์ผ | ์ | ํ | ์ | ๋ชฉ | ๊ธ | ํ |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
- selenium
- OS
- ์นํฌ๋กค๋ง
- Java
- ์ผ์ฑ๊ธฐ์ถ
- BFS
- ์ด์์ฒด์
- ๋งคํธ๋ฉ
- ๋จ์ํ ์คํธ
- JPA
- Crawling
- ์ด์งํ์
- ์๊ณ ๋ฆฌ์ฆ
- ๋ฐฑ์ค
- python
- API
- ToyProject
- ์์์ฒ๋ฆฌ
- ์ ๋ ๋์
- ํ์ด์ฌ
- ์นด์นด์ค
- ํ๋ก๊ทธ๋๋จธ์ค
- matlab
- webcrawling
- ์นด์นด์ค๊ธฐ์ถ
- ๊ทธ๋ฆฌ๋
- spring
- ํด์
- ํด์๋ฒ
- ํฌ๋กค๋ง
DevKim
[ Webtooniverse ] Maria DB ๊ตฌ์ถํ๊ธฐ - ๋ค์ด๋ฒ ์นํฐ ํฌ๋กค๋ง ๋ณธ๋ฌธ
[ Webtooniverse ] Maria DB ๊ตฌ์ถํ๊ธฐ - ๋ค์ด๋ฒ ์นํฐ ํฌ๋กค๋ง
on_doing 2021. 7. 28. 22:21๐ Webtooniverse ์ ํต์ฌ์ธ ์นํฐ ๋ฐ์ดํฐ๋ฅผ ์์งํด๋ณด์ ๐
[ MySql ์ด ์๋, Maria DB๋ฅผ ์ ํํ ์ด์ ]
- ํ์๊ฐ์์ผ๋ฉด ํธํ๊ณ ์ต์ํ Mysql์ ๊ณจ๋์ ํ ์ง๋ง,
'Mysql์์ Maria DB๋ก ๋ง์ด๊ทธ์๋ฆฌ์ ํด์ผ ํ 10๊ฐ์ง ์ด์ '๋ผ๋ ๊ธ์ ์ฝ๊ณ
์ด๋ฒ ๊ธฐํ์ MariaDB๋ฅผ ์ ํํด์ ํ๋ก์ ํธ๋ฅผ ์งํํด๋ณด๊ณ ์ถ์๋ค.
์ฌ์ค Maria DB๊ฐ Mysql๋ณด๋ค ๋ ๋์ ์ฑ๋ฅ์ ์ ๊ณตํ๋ค๊ณ ๋ํ์ง๋ง, ์ด๋ฒ ํ๋ก์ ํธ ์์ฒด๊ฐ ๋๊ฐ์ ์ฑ๋ฅ์ ํ์ฐํ๊ฒ ๋น๊ตํ ์ ์์ ๋งํผ์ ํฐ ํ๋ก์ ํธ๋ ์๋๊ธฐ์ ํฐ ์ฐจ์ด๊ฐ ์์ ๊ฒ ๊ฐ๊ธดํ๋ค.
[ ํ์ํ ๋ฐ์ดํฐ ]
1. ์์ผ๋ณ ์นํฐ
- ์นํฐ ๊ด๋ จ ์๋น์ค์๋ ์ง๊ธ ์ฐ์ฌ๋๊ณ ์๋ ์นํฐ์ด ํ์์ ์ด๊ธฐ์,
๊ฐ์ฅ ๋จผ์ ์์ผ๋ณ ์นํฐ์ DB์ ์ ์ฅํ๊ธฐ๋กํ๋ค.
์์ผ๋ณ ์นํฐ๋ง 400๊ฐ๊ฐ ๋๊ธฐ ๋๋ฌธ์, ์ ๋ ๋์์ ์ฌ์ฉํ์ฌ ์๋ํ ์์คํ ์ ๊ตฌํํ์๋ค.
* ์์ผ๋ณ ์นํฐ ์ด์์ฌํญ *
1. ์ค๋์ ์นํฐ
- ์ค๋์ ์นํฐ์ด ๊ทธ๋ ๊ทธ๋ ๋ณํ๊ธฐ ๋๋ฌธ์,
์ค๋์ ํด๋นํ๋ ์นํฐ์ ํด๋ฆญํ๋ ํํฌ์, ๊ทธ๋ ์ง ์์ ์์ผ์ ํด๋ฆญํ๋ ํ๊ทธ๊ฐ ๋ค๋ฅด๋ค.
์ค๋ ํฌ๋กค๋ง์ ๋ค ๋ชปํ๋ฉด ๋ด์ผ๋ก ๋์ด๊ฐ์ผํ๊ธฐ ๋๋ฌธ์, ํธ์์ฑ์ ์ํด 2๊ฐ์ ๋ฉ์๋๋ก ๋ถ๋ฆฌํด์ฃผ์๋ค.
2. ์นํฐ ์ ๋ชฉ๊ณผ ์๊ฐ
- ์นํฐ์ ๋ชฉ๊ณผ ์๊ฐ ์ด๋ฆ์ด h2์ ๊ฐ์ด ๋ฌถ์ฌ์๊ธฐ ๋๋ฌธ์,
์นํฐ ์ ๋ชฉ๋ง ๋ฐ๋ก ๊ฐ์ ธ์ค๋ ค๋ฉด ํ๋ฒ๋ ๋ฌธ์์ด ์ฒ๋ฆฌ๋ฅผ ํด์ฃผ์ด์ผํ๋ค.
[ ์ฝ๋ ]
toonTitle = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2').text
tempList = toonTitle.split(' ')
tempList = tempList[:-1]
if tempList[-1] == '/':
tempList = tempList[:-2]
toonTitle = ' '.join(tempList)
์นด์นด์ค(์ ๋ค์) ์นํฐ๊ณผ ๋ค์ด๋ฒ ์นํฐ์ ๋ถ๋ฆฌํ๊ณ , ์๊ฒฐ๋ ์นํฐ๊ณผ ์ฐ์ฌ์ค์ธ ์นํฐ์ ๋ถ๋ฆฌํด์ฃผ๊ธฐ ์ํด
Column์ ์นํฐ ํ๋ซํผ๊ณผ ์๊ฒฐ ์ฌ๋ถ๋ฅผ ์ถ๊ฐํด์ฃผ์๋ค.
[ ์ฝ๋ ]
from selenium import webdriver
import pandas as pd
import time
import pymysql
conn = pymysql.connect(db ์ ๋ณด)
cur = conn.cursor()
# Genre Table
sql = "CREATE TABLE GENRE(genreId BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,genreType VARCHAR(45) NOT NULL);"
cur.execute(sql)
sql = "ALTER TABLE GENRE CONVERT TO CHARACTER SET utf8;"
cur.execute(sql)
genre_sql = "INSERT INTO GENRE (genreType) VALUES (%s);"
values = ("์คํฌ์ธ ")
cur.execute(genre_sql, values)
conn.commit()
# Webtoon table
sql = "CREATE TABLE WEBTOON(toonId BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,toonTitle VARCHAR(255) NOT NULL ,toonAuthor VARCHAR(45) NOT NULL,toonContent VARCHAR(4000) NOT NULL,toonImg VARCHAR(255) NOT NULL,toonWeekday VARCHAR(45),realUrl VARCHAR(255) NOT NULL,toonAge VARCHAR(45),toonFlatform VARCHAR(45) NOT NULL,finished boolean);"
cur.execute(sql)
sql = "ALTER TABLE WEBTOON CONVERT TO CHARACTER SET utf8;"
cur.execute(sql)
#sql = "ALTER TABLE WEBTOON MODIFY toonContent VARCHAR(4000)"
#cur.execute(sql)
# Webtoon_Genre Table
sql = "CREATE TABLE WEBTOON_GENRE(toonGenreId BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,toonId BIGINT NOT NULL ,genreId BIGINT NOT NULL);"
cur.execute(sql)
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://comic.naver.com/webtoon/weekday')
time.sleep(4)
data = []
def weekly_crawl(day_name, num, start, end):
global toon_num
for i in range(start, end + 1):
c = driver.find_element_by_css_selector(
f'#content > div.list_area.daily_all > div:nth-child({num}) > div > ul > li:nth-child({i}) > div > a > span')
c.click()
time.sleep(3)
toonTitle = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2').text
tempList = toonTitle.split(' ')
tempList = tempList[:-1]
if tempList[-1] == '/':
tempList = tempList[:-2]
toonTitle = ' '.join(tempList)
toonAuthor = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2 > span').text
toonContent = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > p:nth-child(2)').text
toonImg = driver.find_element_by_css_selector('#content > div.comicinfo > div.thumb > a > img').get_attribute(
'src')
toonWeekday = day_name
realUrl = driver.current_url
toonAge = driver.find_element_by_css_selector(
'#content > div.comicinfo > div.detail > p.detail_info > span.age').text
toonFlatform = "๋ค์ด๋ฒ"
finished = False
# ์ฅ๋ฅด-์นํฐ ์ฐ๊ด๊ด๊ณ ์ค์
genre = driver.find_element_by_css_selector(
'#content > div.comicinfo > div.detail > p.detail_info > span.genre').text.strip()
genre_List = genre.split(',')
for gl in genre_List:
gl = gl.strip()
genreId = dic[gl]
sql = "INSERT INTO WEBTOON_GENRE(toonId,genreId) VALUES (%s, %s)"
val = (toon_num, genreId)
cur.execute(sql, val)
conn.commit()
time.sleep(5)
# toonId +1
toon_num += 1
data.append(
[toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished])
print([toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished])
sql = "INSERT INTO WEBTOON (toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"
val = (toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished)
cur.execute(sql, val)
conn.commit()
time.sleep(5)
driver.back()
time.sleep(2)
์ฝ๋ ๋๋ ค๋๊ณ ์ ์ ๋ค์๋๋ฐ,
์ผ์ด๋์ ํ์ธํด๋ณด๋ toonContent(์นํฐ ์ค๋ช ๊ธ)์ด ๋๋ฌด ๊ธธ๋ค๊ณ ์ค๋ฅ๊ฐ ๋๋ฉด์ ๋ฉ์ถฐ์์๋ค.
๋ค์๋ ๊ธธ๋ค๊ณ ๋ชปํ๊ฒ,,์ฒ์์ VARCHAR(255)๋ก ์ค์ ํด๋์ ๊ฒ์ VARCHAR(4000)์ผ๋ก ๋ํญ ๋๋ ค์คฌ๋ค.
WEBTOON table์ ํ์ธํด๋ณด๋ ์ ๋ชฉ๊ณผ ์๊ฐ๋ช ๋ ์ ๋ถ๋ฆฌ๋์ด ๋ฝ์์์ก๋ค.
2. ์๊ฒฐ ์นํฐ
- ์๊ฒฐ ์นํฐ๋ง ๊ฑฐ์ 1000๊ฐ์ฏค ๋๋ ๊ฒ ๊ฐ์์,
์ธ๊ธฐ์์ผ๋ก ๋์ด ํ ๋ค, 33์ค๋ง ๊ฐ์ ธ์ค๊ธฐ๋กํ๋ค. (99๊ฐ)
[ ์ฝ๋ ]
def finished_webtoon():
# global toon_num
c = driver.find_element_by_css_selector('#submenu > ul > li:nth-child(7) > a')
c.click()
time.sleep(3)
# ์ธ๊ธฐ์ ํด๋ฆญ
c1 = driver.find_element_by_css_selector('#content > div.view_type > ul.sortby > li:nth-child(1) > a')
c1.click()
time.sleep(5)
for i in range(1, 100):
c2 = driver.find_element_by_css_selector(f'#content > div.list_area > ul > li:nth-child({i}) > dl > dt > a')
c2.click()
time.sleep(3)
toonTitle = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2').text
tempList = toonTitle.split(' ')
tempList = tempList[:-1]
if tempList[-1] == '/':
tempList = tempList[:-2]
toonTitle = ' '.join(tempList)
toonAuthor = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2 > span').text
toonContent = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > p:nth-child(2)').text
toonImg = driver.find_element_by_css_selector('#content > div.comicinfo > div.thumb > a > img').get_attribute(
'src')
toonWeekday = None
realUrl = driver.current_url
try:
toonAge = driver.find_element_by_css_selector(
'#content > div.comicinfo > div.detail > p.detail_info > span.age').text
except:
toonAge = None
toonFlatform = "๋ค์ด๋ฒ"
finished = True
# ์ฅ๋ฅด-์นํฐ ์ฐ๊ด๊ด๊ณ ์ค์
genre = driver.find_element_by_css_selector(
'#content > div.comicinfo > div.detail > p.detail_info > span.genre').text.strip()
genre_List = genre.split(',')
for gl in genre_List:
gl = gl.strip()
genreId = dic[gl]
sql = "INSERT INTO WEBTOON_GENRE(toonId,genreId) VALUES (%s, %s)"
val = (toon_num, genreId)
cur.execute(sql, val)
conn.commit()
time.sleep(5)
# toonId +1
toon_num += 1
print([toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished])
data.append(
[toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished])
sql = "INSERT INTO WEBTOON (toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"
val = (toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished)
cur.execute(sql, val)
conn.commit()
driver.back()
time.sleep(3)
Q. [ ์ฅ๋ฅด_์นํฐ Join Table์ DB์ ์ด๋ป๊ฒ ์ ์ฅํ ๊ฒ์ธ๊ฐ? ]
Entity ์ค๊ณํ ๊ฒ์ ๋ณด๋ฉด ์๊ฒ ์ง๋ง, webtoon์๋ ์ฅ๋ฅด ์ ๋ณด๊ฐ ๋ฐ๋ก ์ปฌ๋ผ์ผ๋ก ๋ค์ด๊ฐ์์ง ์๊ณ ,
๋ค:๋ค ๊ด๊ณ๋ฅผ ํ์ด๋ด๊ธฐ ์ํด WebtoonGenre ๋ผ๋ ์ฐ๊ฒฐ ํ ์ด๋ธ์ ํ๋ ๋์๋ค.
๋ฌธ์ ๋ ํฌ๋กค๋งํ ์ฅ๋ฅด๋ "๋ก๋งจ์ค,์ผ์,ํ์" ๊ฐ์ด ํ๋์ String์ผ๋ก ๋ค์ด๊ฐ๊ฒ ๋๋ค๋ ์ ์ด๊ณ ,
์กฐ์ธ ํ ์ด๋ธ์๋ Webtoon๊ณผ Genre์ FK๊ฐ ๋ค์ด๊ฐ๋ค๋ ๊ฒ์ด์๋ค.
[ ํด๊ฒฐ๋ฐฉ๋ฒ ]
-์ด๊ฒ ๋ง๋ ๋ฐฉ๋ฒ์ธ์ง๋ ๋ชจ๋ฅด๊ฒ ์ง๋ง,
์์งํ ์นํฐ ๋ฐ์ดํฐ์ ์ฅ๋ฅด์ ๋งคํ์ ์ง์ ํด์ค์ผํ๊ธฐ ๋๋ฌธ์
PK๋ฅผ AUTO_INCREMENT๋ก ์ค์ ํด๋ ์ ์ ์ด์ฉํ์ฌ, ์ ์ญ ๋ณ์๋ก 1์ฉ ๋ค์ด๋๊ฒ ํด์ฃผ์๊ณ
๋์ ๋๋ฆฌ๋ฅผ ์ด์ฉํ์ฌ ๊ฐ๊ฐ์ String์ int ํํ๋ก ๋ณ๊ฒฝํด์ฃผ์๋ค.
๋ง๋ค๊ณ ์ํ๋ ํ ์ด๋ธ์ ๋ค์๊ณผ ๊ฐ์ ํํ์ด๋ค.
[ ์ฝ๋ ]
toon_num = 1
dic = {
"์ํผ์๋": 1,
"์ด๋๋ฒ์ค": 2,
"์คํ ๋ฆฌ": 3,
"์ผ์": 4,
"๊ฐ๊ทธ": 5,
"ํํ์ง": 6,
"์ก์
": 7,
"๋๋ผ๋ง": 8,
"๋ก๋งจ์ค": 9,
"๊ฐ์ฑ": 10,
"์ค๋ฆด๋ฌ": 11,
"์๋๊ทน": 12,
"์คํฌ์ธ ": 13
}
# ์ฅ๋ฅด-์นํฐ ์ฐ๊ด๊ด๊ณ ์ค์
genre = driver.find_element_by_css_selector(
'#content > div.comicinfo > div.detail > p.detail_info > span.genre').text.strip()
genre_List = genre.split(',')
for gl in genre_List:
gl = gl.strip()
genreId = dic[gl]
sql = "INSERT INTO WEBTOON_GENRE(toonId,genreId) VALUES (%s, %s)"
val = (toon_num, genreId)
cur.execute(sql, val)
conn.commit()
time.sleep(5)
# toonId +1
toon_num += 1
[ ๊ฒฐ๊ณผ ]
์ด 536๊ฐ์ ์นํฐ ์ํ๋ค์ด 2๊ฐ์ฉ ์ฅ๋ฅด๋ฅผ ๊ฐ์ง๊ณ ์๊ธฐ ๋๋ฌธ์,
1072๊ฐ์ ๋ฐ์ดํฐ๊ฐ ์ ์์ฑ์ด ๋ ๊ฒ์ ํ์ธํ ์ ์๋ค.