DevKim

[ Webtooniverse ] Maria DB ๊ตฌ์ถ•ํ•˜๊ธฐ - ๋„ค์ด๋ฒ„ ์›นํˆฐ ํฌ๋กค๋ง ๋ณธ๋ฌธ

Spring Project/Webtooniverse

[ Webtooniverse ] Maria DB ๊ตฌ์ถ•ํ•˜๊ธฐ - ๋„ค์ด๋ฒ„ ์›นํˆฐ ํฌ๋กค๋ง

on_doing 2021. 7. 28. 22:21
728x90

๐Ÿƒ Webtooniverse ์˜ ํ•ต์‹ฌ์ธ ์›นํˆฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•ด๋ณด์ž ๐Ÿƒ


[ MySql ์ด ์•„๋‹Œ, Maria DB๋ฅผ ์„ ํƒํ•œ ์ด์œ  ]

- ํ‰์†Œ๊ฐ™์•˜์œผ๋ฉด ํŽธํ•˜๊ณ  ์ต์ˆ™ํ•œ Mysql์„ ๊ณจ๋ž์„ ํ…Œ์ง€๋งŒ,

'Mysql์—์„œ Maria DB๋กœ ๋งˆ์ด๊ทธ์—๋ฆฌ์…˜ ํ•ด์•ผ ํ•  10๊ฐ€์ง€ ์ด์œ '๋ผ๋Š” ๊ธ€์„ ์ฝ๊ณ 

์ด๋ฒˆ ๊ธฐํšŒ์— MariaDB๋ฅผ ์„ ํƒํ•ด์„œ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•ด๋ณด๊ณ  ์‹ถ์—ˆ๋‹ค.

์‚ฌ์‹ค Maria DB๊ฐ€ Mysql๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค๊ณ ๋Š”ํ•˜์ง€๋งŒ, ์ด๋ฒˆ ํ”„๋กœ์ ํŠธ ์ž์ฒด๊ฐ€ ๋‘๊ฐœ์˜ ์„ฑ๋Šฅ์„ ํ™•์—ฐํ•˜๊ฒŒ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์„ ๋งŒํผ์˜ ํฐ ํ”„๋กœ์ ํŠธ๋Š” ์•„๋‹ˆ๊ธฐ์— ํฐ ์ฐจ์ด๊ฐ€ ์—†์„ ๊ฒƒ ๊ฐ™๊ธดํ•˜๋‹ค.


[ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ]

1. ์š”์ผ๋ณ„ ์›นํˆฐ

- ์›นํˆฐ ๊ด€๋ จ ์„œ๋น„์Šค์—๋Š” ์ง€๊ธˆ ์—ฐ์žฌ๋˜๊ณ  ์žˆ๋Š” ์›นํˆฐ์ด ํ•„์ˆ˜์ ์ด๊ธฐ์—,

๊ฐ€์žฅ ๋จผ์ € ์š”์ผ๋ณ„ ์›นํˆฐ์„ DB์— ์ €์žฅํ•˜๊ธฐ๋กœํ–ˆ๋‹ค.

 

์š”์ผ๋ณ„ ์›นํˆฐ๋งŒ 400๊ฐœ๊ฐ€ ๋„˜๊ธฐ ๋•Œ๋ฌธ์—, ์…€๋ ˆ๋‹ˆ์›€์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋™ํ™” ์‹œ์Šคํ…œ์„ ๊ตฌํ˜„ํ•˜์˜€๋‹ค.

 

* ์š”์ผ๋ณ„ ์›นํˆฐ ์ด์Šˆ์‚ฌํ•ญ *

1. ์˜ค๋Š˜์˜ ์›นํˆฐ

- ์˜ค๋Š˜์˜ ์›นํˆฐ์ด ๊ทธ๋‚  ๊ทธ๋‚  ๋ณ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์—,

์˜ค๋Š˜์— ํ•ด๋‹นํ•˜๋Š” ์›นํˆฐ์„ ํด๋ฆญํ•˜๋Š” ํƒœํฌ์™€, ๊ทธ๋ ‡์ง€ ์•Š์€ ์š”์ผ์„ ํด๋ฆญํ•˜๋Š” ํƒœ๊ทธ๊ฐ€ ๋‹ค๋ฅด๋‹ค. 

์˜ค๋Š˜ ํฌ๋กค๋ง์„ ๋‹ค ๋ชปํ•˜๋ฉด ๋‚ด์ผ๋กœ ๋„˜์–ด๊ฐ€์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํŽธ์˜์„ฑ์„ ์œ„ํ•ด 2๊ฐœ์˜ ๋ฉ”์†Œ๋“œ๋กœ ๋ถ„๋ฆฌํ•ด์ฃผ์—ˆ๋‹ค.

 

2. ์›นํˆฐ ์ œ๋ชฉ๊ณผ ์ž‘๊ฐ€

- ์›นํˆฐ์ œ๋ชฉ๊ณผ ์ž‘๊ฐ€ ์ด๋ฆ„์ด h2์— ๊ฐ™์ด ๋ฌถ์—ฌ์žˆ๊ธฐ ๋•Œ๋ฌธ์—,

์›นํˆฐ ์ œ๋ชฉ๋งŒ ๋”ฐ๋กœ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ํ•œ๋ฒˆ๋” ๋ฌธ์ž์—ด ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ฃผ์–ด์•ผํ•œ๋‹ค.

 

[ ์ฝ”๋“œ ]

toonTitle = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2').text
tempList = toonTitle.split(' ')
tempList = tempList[:-1]

if tempList[-1] == '/':
	tempList = tempList[:-2]

toonTitle = ' '.join(tempList)

์นด์นด์˜ค(์ „ ๋‹ค์Œ) ์›นํˆฐ๊ณผ ๋„ค์ด๋ฒ„ ์›นํˆฐ์„ ๋ถ„๋ฆฌํ•˜๊ณ , ์™„๊ฒฐ๋œ ์›นํˆฐ๊ณผ ์—ฐ์žฌ์ค‘์ธ ์›นํˆฐ์„ ๋ถ„๋ฆฌํ•ด์ฃผ๊ธฐ ์œ„ํ•ด

Column์— ์›นํˆฐ ํ”Œ๋žซํผ๊ณผ ์™„๊ฒฐ ์—ฌ๋ถ€๋ฅผ ์ถ”๊ฐ€ํ•ด์ฃผ์—ˆ๋‹ค.

 

[ ์ฝ”๋“œ ]

from selenium import webdriver
import pandas as pd
import time
import pymysql

conn = pymysql.connect(db ์ •๋ณด)
cur = conn.cursor()

# Genre Table
sql = "CREATE TABLE GENRE(genreId BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,genreType VARCHAR(45) NOT NULL);"
cur.execute(sql)
sql = "ALTER TABLE GENRE CONVERT TO CHARACTER SET utf8;"
cur.execute(sql)

genre_sql = "INSERT INTO GENRE (genreType) VALUES (%s);"
values = ("์Šคํฌ์ธ ")
cur.execute(genre_sql, values)
conn.commit()

# Webtoon table
sql = "CREATE TABLE WEBTOON(toonId BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,toonTitle VARCHAR(255) NOT NULL ,toonAuthor VARCHAR(45) NOT NULL,toonContent VARCHAR(4000) NOT NULL,toonImg VARCHAR(255) NOT NULL,toonWeekday VARCHAR(45),realUrl VARCHAR(255) NOT NULL,toonAge VARCHAR(45),toonFlatform VARCHAR(45) NOT NULL,finished boolean);"
cur.execute(sql)
sql = "ALTER TABLE WEBTOON CONVERT TO CHARACTER SET utf8;"
cur.execute(sql)

#sql = "ALTER TABLE WEBTOON MODIFY toonContent VARCHAR(4000)"
#cur.execute(sql)

# Webtoon_Genre Table
sql = "CREATE TABLE WEBTOON_GENRE(toonGenreId BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,toonId BIGINT NOT NULL ,genreId BIGINT NOT NULL);"
cur.execute(sql)

driver = webdriver.Chrome('chromedriver.exe')

driver.get('https://comic.naver.com/webtoon/weekday')
time.sleep(4)

data = []

def weekly_crawl(day_name, num, start, end):
    global toon_num
    for i in range(start, end + 1):
        c = driver.find_element_by_css_selector(
            f'#content > div.list_area.daily_all > div:nth-child({num}) > div > ul > li:nth-child({i}) > div > a > span')
        c.click()
        time.sleep(3)

        toonTitle = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2').text
        tempList = toonTitle.split(' ')
        tempList = tempList[:-1]

        if tempList[-1] == '/':
            tempList = tempList[:-2]

        toonTitle = ' '.join(tempList)
        toonAuthor = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2 > span').text
        toonContent = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > p:nth-child(2)').text
        toonImg = driver.find_element_by_css_selector('#content > div.comicinfo > div.thumb > a > img').get_attribute(
            'src')
        toonWeekday = day_name
        realUrl = driver.current_url
        toonAge = driver.find_element_by_css_selector(
            '#content > div.comicinfo > div.detail > p.detail_info > span.age').text
        toonFlatform = "๋„ค์ด๋ฒ„"
        finished = False

        # ์žฅ๋ฅด-์›นํˆฐ ์—ฐ๊ด€๊ด€๊ณ„ ์„ค์ •
        genre = driver.find_element_by_css_selector(
            '#content > div.comicinfo > div.detail > p.detail_info > span.genre').text.strip()
        genre_List = genre.split(',')

        for gl in genre_List:
            gl = gl.strip()
            genreId = dic[gl]
            sql = "INSERT INTO WEBTOON_GENRE(toonId,genreId) VALUES (%s, %s)"
            val = (toon_num, genreId)
            cur.execute(sql, val)
            conn.commit()

            time.sleep(5)

        # toonId +1
        toon_num += 1

        data.append(
            [toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished])

        print([toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished])

        sql = "INSERT INTO WEBTOON (toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished)
        cur.execute(sql, val)
        conn.commit()

        time.sleep(5)

        driver.back()

        time.sleep(2)

 

์ฝ”๋“œ ๋Œ๋ ค๋†“๊ณ  ์ž ์— ๋“ค์—ˆ๋Š”๋ฐ,

์ผ์–ด๋‚˜์„œ ํ™•์ธํ•ด๋ณด๋‹ˆ toonContent(์›นํˆฐ ์„ค๋ช…๊ธ€)์ด ๋„ˆ๋ฌด ๊ธธ๋‹ค๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋‚˜๋ฉด์„œ ๋ฉˆ์ถฐ์žˆ์—ˆ๋‹ค.

๋‹ค์‹œ๋Š” ๊ธธ๋‹ค๊ณ  ๋ชปํ•˜๊ฒŒ,,์ฒ˜์Œ์—” VARCHAR(255)๋กœ ์„ค์ •ํ•ด๋†“์€ ๊ฒƒ์„ VARCHAR(4000)์œผ๋กœ ๋Œ€ํญ ๋Š˜๋ ค์คฌ๋‹ค.

 

WEBTOON table์„ ํ™•์ธํ•ด๋ณด๋‹ˆ ์ œ๋ชฉ๊ณผ ์ž‘๊ฐ€๋ช…๋„ ์ž˜ ๋ถ„๋ฆฌ๋˜์–ด ๋ฝ‘์•„์™€์กŒ๋‹ค.


2. ์™„๊ฒฐ ์›นํˆฐ

- ์™„๊ฒฐ ์›นํˆฐ๋งŒ ๊ฑฐ์˜ 1000๊ฐœ์ฏค ๋˜๋Š” ๊ฒƒ ๊ฐ™์•„์„œ,

์ธ๊ธฐ์ˆœ์œผ๋กœ ๋‚˜์—ด ํ•œ ๋’ค, 33์ค„๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ๋กœํ–ˆ๋‹ค. (99๊ฐœ)

 

[ ์ฝ”๋“œ ]

def finished_webtoon():
    # global toon_num
    c = driver.find_element_by_css_selector('#submenu > ul > li:nth-child(7) > a')
    c.click()
    time.sleep(3)

    # ์ธ๊ธฐ์ˆœ ํด๋ฆญ
    c1 = driver.find_element_by_css_selector('#content > div.view_type > ul.sortby > li:nth-child(1) > a')
    c1.click()
    time.sleep(5)

    for i in range(1, 100):
        c2 = driver.find_element_by_css_selector(f'#content > div.list_area > ul > li:nth-child({i}) > dl > dt > a')
        c2.click()
        time.sleep(3)

        toonTitle = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2').text
        tempList = toonTitle.split(' ')
        tempList = tempList[:-1]

        if tempList[-1] == '/':
            tempList = tempList[:-2]

        toonTitle = ' '.join(tempList)

        toonAuthor = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > h2 > span').text
        toonContent = driver.find_element_by_css_selector('#content > div.comicinfo > div.detail > p:nth-child(2)').text
        toonImg = driver.find_element_by_css_selector('#content > div.comicinfo > div.thumb > a > img').get_attribute(
            'src')
        toonWeekday = None
        realUrl = driver.current_url

        try:
            toonAge = driver.find_element_by_css_selector(
                '#content > div.comicinfo > div.detail > p.detail_info > span.age').text
        except:
            toonAge = None

        toonFlatform = "๋„ค์ด๋ฒ„"
        finished = True

        # ์žฅ๋ฅด-์›นํˆฐ ์—ฐ๊ด€๊ด€๊ณ„ ์„ค์ •
        genre = driver.find_element_by_css_selector(
            '#content > div.comicinfo > div.detail > p.detail_info > span.genre').text.strip()
        genre_List = genre.split(',')

        for gl in genre_List:
            gl = gl.strip()
            genreId = dic[gl]
            sql = "INSERT INTO WEBTOON_GENRE(toonId,genreId) VALUES (%s, %s)"
            val = (toon_num, genreId)
            cur.execute(sql, val)
            conn.commit()

            time.sleep(5)

        # toonId +1
        toon_num += 1

        print([toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished])

        data.append(
            [toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished])

        sql = "INSERT INTO WEBTOON (toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (toonTitle, toonAuthor, toonContent, toonImg, toonWeekday, realUrl, toonAge, toonFlatform, finished)
        cur.execute(sql, val)
        conn.commit()

        driver.back()
        time.sleep(3)

Q. [ ์žฅ๋ฅด_์›นํˆฐ Join Table์„ DB์— ์–ด๋–ป๊ฒŒ ์ €์žฅํ•  ๊ฒƒ์ธ๊ฐ€? ]

 

Entity ์„ค๊ณ„ํ•œ ๊ฒƒ์„ ๋ณด๋ฉด ์•Œ๊ฒ ์ง€๋งŒ, webtoon์—๋Š” ์žฅ๋ฅด ์ •๋ณด๊ฐ€ ๋”ฐ๋กœ ์ปฌ๋Ÿผ์œผ๋กœ ๋“ค์–ด๊ฐ€์žˆ์ง€ ์•Š๊ณ ,

๋‹ค:๋‹ค ๊ด€๊ณ„๋ฅผ ํ’€์–ด๋‚ด๊ธฐ ์œ„ํ•ด WebtoonGenre ๋ผ๋Š” ์—ฐ๊ฒฐ ํ…Œ์ด๋ธ”์„ ํ•˜๋‚˜ ๋‘์—ˆ๋‹ค.

 

๋ฌธ์ œ๋Š” ํฌ๋กค๋งํ•œ ์žฅ๋ฅด๋Š”  "๋กœ๋งจ์Šค,์ผ์ƒ,ํ•™์›" ๊ฐ™์ด ํ•˜๋‚˜์˜ String์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค๋Š” ์ ์ด๊ณ ,

์กฐ์ธ ํ…Œ์ด๋ธ”์—๋Š” Webtoon๊ณผ Genre์˜ FK๊ฐ€ ๋“ค์–ด๊ฐ„๋‹ค๋Š” ๊ฒƒ์ด์—ˆ๋‹ค.

 


[ ํ•ด๊ฒฐ๋ฐฉ๋ฒ• ]

-์ด๊ฒŒ ๋งž๋Š” ๋ฐฉ๋ฒ•์ธ์ง€๋Š” ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ,

์ˆ˜์ง‘ํ•œ ์›นํˆฐ ๋ฐ์ดํ„ฐ์™€ ์žฅ๋ฅด์˜ ๋งคํ•‘์„ ์ง์ ‘ํ•ด์ค˜์•ผํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—

PK๋ฅผ AUTO_INCREMENT๋กœ ์„ค์ •ํ•ด๋‘” ์ ์„ ์ด์šฉํ•˜์—ฌ, ์ „์—ญ ๋ณ€์ˆ˜๋กœ 1์”ฉ ๋“ค์–ด๋‚˜๊ฒŒ ํ•ด์ฃผ์—ˆ๊ณ 

๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ๊ฐ์˜ String์„ int ํ˜•ํƒœ๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ์—ˆ๋‹ค.

 

๋งŒ๋“ค๊ณ ์žํ•˜๋Š” ํ…Œ์ด๋ธ”์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ์ด๋‹ค.

[ ์ฝ”๋“œ ]

toon_num = 1

dic = {
    "์—ํ”ผ์†Œ๋“œ": 1,
    "์˜ด๋‹ˆ๋ฒ„์Šค": 2,
    "์Šคํ† ๋ฆฌ": 3,
    "์ผ์ƒ": 4,
    "๊ฐœ๊ทธ": 5,
    "ํŒํƒ€์ง€": 6,
    "์•ก์…˜": 7,
    "๋“œ๋ผ๋งˆ": 8,
    "๋กœ๋งจ์Šค": 9,
    "๊ฐ์„ฑ": 10,
    "์Šค๋ฆด๋Ÿฌ": 11,
    "์‹œ๋Œ€๊ทน": 12,
    "์Šคํฌ์ธ ": 13
}

# ์žฅ๋ฅด-์›นํˆฐ ์—ฐ๊ด€๊ด€๊ณ„ ์„ค์ •
        genre = driver.find_element_by_css_selector(
            '#content > div.comicinfo > div.detail > p.detail_info > span.genre').text.strip()
        genre_List = genre.split(',')

        for gl in genre_List:
            gl = gl.strip()
            genreId = dic[gl]
            sql = "INSERT INTO WEBTOON_GENRE(toonId,genreId) VALUES (%s, %s)"
            val = (toon_num, genreId)
            cur.execute(sql, val)
            conn.commit()

            time.sleep(5)

        # toonId +1
        toon_num += 1

[ ๊ฒฐ๊ณผ ]

์ด 536๊ฐœ์˜ ์›นํˆฐ ์ž‘ํ’ˆ๋“ค์ด 2๊ฐœ์”ฉ ์žฅ๋ฅด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—,

1072๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ž˜ ์ƒ์„ฑ์ด ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

728x90
Comments