제 3장 시카고 샌드위치 맛집 분석¶

구성 및 블로그 진행 과정¶

3-1 웹데이터를 가져오는 Beautiful Soup 익히기
3-2 크롬 개발자 도구를 이용해서 원하는 태그 찾기
3-3 실전: 시카고 샌드위치 맛집 소개 사이트에 접근하기
3-4 접근한 웹 페이지에서 원하는 데이터 추출하고 정리하기
-------------------------------------------------------
3-5 다수의 웹 페이지에 자동으로 접근해서 원하는 정보 가져오기
3-6 Jupyter Notebook에서 상태 진행바를 쉽게 만들어주는 tqdm 모듈
3-7 상태 진행바까지 적용하고 다시 샌드위치 페이지 50개에 접근하기
3-8 50개 웹 페이지에 대한 정보 가져오기
3-9 맛집 위치를 지도에 표기하기
-------------------------------------------------------
3-10 네이버 영화 평점 기준 영화의 평점 변화 확인하기
3-11 영화별 날짜 변화에 따른 평점 변화 확인하기

출처: 파이썬으로 데이터 주무르기 by 민형기

3-1 웹 데이터를 가져오는 Beautiful Soup 익히기¶

https://github.com/PinkWink/DataScience/tree/master/data

test_first.html 다운 받기

from bs4 import BeautifulSoup

# 03. test_first.html은 다음과 같은 형태이다.
# 파일로 다운받은 html을 읽는 것이기 때문에 open 명령으로 읽기 옵션('r')을 주고 읽으면 된다.<br>
# 읽은 html 페이지의 내용을 전체 다 보고 싶으면 prettify()라는 옵션을 사용하면 들여쓰기가 되어 보기 좋게 나타난다.
page = open('pydata/03. test_first.html', 'r').read()
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Very Simple HTML Code by PinkWink
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    Happy PinkWink.
    <a href="http://www.pinkwink.kr" id="pw-link">
     PinkWink
    </a>
   </p>
   <p class="inner-text second-item">
    Happy Data Science.
    <a href="https://www.python.org" id="py-link">
     Python
    </a>
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    Data Science is funny.
   </b>
  </p>
  <p class="outer-text">
   <b>
    All I need is Love.
   </b>
  </p>
 </body>
</html>

body 찾기 -1¶

# 전체 html 코드를 soup라는 변수에 저장했는데, 
#그 soup라는 변수에서 한 단계 아래에서 포함된 태그들을 알고 싶으면 children이라는 속성을 사용하면 된다.
list(soup.children)

['html', '\n', <html>
 <head>
 <title>Very Simple HTML Code by PinkWink</title>
 </head>
 <body>
 <div>
 <p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>
 </div>
 <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>
 </body>
 </html>]

#soup는 문서 전체를 저장한 변수이기 때문에 그 안에서 html 태그에 접속하려면
html = list(soup.children)[2]
html

<html>
<head>
<title>Very Simple HTML Code by PinkWink</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>
</html>

# 위에서 구한 html의 children을 조사해보면 다음과 같이 나타난다.
list(html.children)

['\n', <head>
 <title>Very Simple HTML Code by PinkWink</title>
 </head>, '\n', <body>
 <div>
 <p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>
 </div>
 <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>
 <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>
 </body>, '\n']

우리가 흔히 보게 되는 부분이 body 태그의 내용이다.

# html의 children중 3번을 조사해보면 중 3번을 조사해보면 body 태그가 나타난다.
body = list(html.children)[3]
body

<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>

body 찾기 -2¶

# 위와 같이 children과 parent를 이용해서 태그를 조사할 수 있지만 한번에 나타낼 수도 있다!!!!!!!!!!
soup.body

<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>

body 찾기 -3¶

# 이렇게 바로 찾을수도 있다.
list(body.children)

['\n', <div>
 <p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>
 <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>
 </div>, '\n', <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>, '\n', <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>, '\n']

body 태그 안의 children 리스트도 확인할 수 있다.

find, find_all¶

# 접근해야 할 태그를 알고 있다면 find나 find_all 명령을 많이 사용하게 된다.
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>, <p class="inner-text second-item">
                 Happy Data Science.
                 <a href="https://www.python.org" id="py-link">Python</a>
 </p>, <p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>, <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

# find는 하나만 찾을 때 사용한다.
# 이렇게 사용하면 제일 첫 번째 p 태그 찾아준다.
soup.find('p')

<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>

# 이렇게 p 태그의 class가 outer-text인 것을 찾는 것도 가능하다.
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>, <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

# 혹은 그냥 class 이름으로 outer-text를 찾을 수 있다.
soup.find_all(class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 Data Science is funny.
             </b>
 </p>, <p class="outer-text">
 <b>
                 All I need is Love.
             </b>
 </p>]

#id가 first인 태그들을 찾을 수도 있다.
soup.find_all(id='first')

[<p class="inner-text first-item" id="first">
                 Happy PinkWink.
                 <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
 </p>]

find 명령은 제일 처음 나타난 태그만 찾아주기 때문에 그 다음 태그를 찾고 싶을 때¶

soup.head()

[<title>Very Simple HTML Code by PinkWink</title>]

# next_sibling을 통해 soup의 head 다음에 줄바꿈 문자가 있다는 것을 알 수 있다.
soup.head.next_sibling

'\n'

# head와 같은 위치에 있던 body 태그로 접근하는 방법
soup.head.next_sibling.next_sibling

<body>
<div>
<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
                Data Science is funny.
            </b>
</p>
<p class="outer-text">
<b>
                All I need is Love.
            </b>
</p>
</body>

# 제일 처음 나타나는 p 태그
body.p

<p class="inner-text first-item" id="first">
                Happy PinkWink.
                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>

# 다음과 같이 next_sibling을 두 번 걸어 다음 p 태그로 이동할 수 있다.
body.p.next_sibling.next_sibling

<p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" id="py-link">Python</a>
</p>

get_text()를 이용하여 텍스트 가져오기¶

# get_text()명령으로 태그 안에 있는 텍스트만 가지고 올 수 있다.
for each_tag in soup.find_all('p'):
    print(each_tag.get_text())

                Happy PinkWink.
                PinkWink


                Happy Data Science.
                Python


                Data Science is funny.
            

                All I need is Love.

# body 전체에서 get_text를 하면 태그가 있던 자리는 줄바꿈(\n)이 표시되고 전체 텍스트를 보여준다.
body.get_text()

'\n\n\n                Happy PinkWink.\n                PinkWink\n\n\n                Happy Data Science.\n                Python\n\n\n\n\n                Data Science is funny.\n            \n\n\n\n                All I need is Love.\n            \n\n'

href 링크 주소 찾기¶

# 클릭 가능한 링크를 의미하는 a 태그를 찾는다.
links = soup.find_all('a')
links

[<a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>,
 <a href="https://www.python.org" id="py-link">Python</a>]

# links에서 href 속성을 찾으면 링크 주소를 얻을 수 있다.
for each in links:
    href = each['href']
    text = each.string
    print(text + '->' + href)

PinkWink->http://www.pinkwink.kr
Python->https://www.python.org

3-2 크롬 개발자 도구를 이용해서 원하는 태그 찾기¶

https://finance.naver.com/marketindex/
도구 더보기 > 개발자 도구에서 다음과 같은 순서로 누르면 1,169.00원에 대한 태그를 알 수 있다.
span 태그의 value라는 class를 얻으면 된다.

from PIL import Image
Image.open('crawling1.png')

# url로 접근하는 경우 urlib에서 urlopen이라는 함수를 import해둔다.
from urllib.request import urlopen

# 해당 페이지를 읽어오자. prettify()로 print()를 해도 사실 확인하기는 쉽지 않다. 엄청 많이 나와 어떻게 확인하냐;
url = 'https://finance.naver.com/marketindex/'
page = urlopen(url)

soup = BeautifulSoup(page, 'html.parser')

# print(soup.prettify())

# 위 그림에서 접근해야 할 태그를 알아 두었으니 다음과 같이 접근하자.
soup.find_all('span','value')[0].string

'1,171.50'

find_all로 찾고 리스트로 결과가 반환되므로 첫번째를 선택하도록 하였다.

3-3 실전 : 시카고 샌드위치 맛집 소개 사이트에 접근하기¶

이제 시카고의 베스트 샌드위치 가게를 소개하고 있는 시카고 매거진 홈페이지에 접속하여 샌드위치 가게 정보를 얻어오자.

https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/

메뉴의 이름과 가게의 이름이 있다. 그리고 Read More라는 버튼을 누르면 시카고 매거진에서 각 가게를 리뷰한 페이지로 넘어가게 된다. 일단 목표는 가게 이름, 가게 메인 메뉴, 각 가게 소개 페이지를 정리하는 것으로 하자.

Image.open('crawling1.png')

앞에서 수행한 대로 크롬 개발자 도구를 이용해서 BLT와 Old Oak Tap을 클릭한다. 그러면 그에 해당하는 태그가 나타나게 된다.

#html 코드를 다 받자.
# url_base와 url_sub로 나눈이유는 보기 좋게 하려고.
from bs4 import BeautifulSoup
from urllib.request import urlopen

url_base = 'https://www.chicagomag.com'
url_sub = '/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'
url = url_base + url_sub

html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

위 그림에서 하이라이트는 a 태그에 되어 있지만 실제 사용해야 할 태그는 그 위의 div 태그에 class sammy이거나 sammyListing일 것이다. 마우스로 태그 옆에 세모 모양을 클릭해보면 BLT, Old Oak Tap, Read more과 같은 글자들에 해당하는 부분을 확인할 수 있다.

# find_all 명령을 이용해서 div의 sammy 태그를 찾아 보았다.
print(soup.find_all('div','sammy')[:3])

[<div class="sammy" style="position: relative;">
<div class="sammyRank">1</div>
<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br>
Old Oak Tap<br>
<em>Read more</em> </br></br></a></div>
</div>, <div class="sammy" style="position: relative;">
<div class="sammyRank">2</div>
<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Au-Cheval-Fried-Bologna/"><b>Fried Bologna</b><br/>
Au Cheval<br/>
<em>Read more</em> </a></div>
</div>, <div class="sammy" style="position: relative;">
<div class="sammyRank">3</div>
<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Xoco-Woodland-Mushroom/"><b>Woodland Mushroom</b><br/>
Xoco<br/>
<em>Read more</em> </a></div>
</div>]

두번째 줄에 1
세번째 줄에 BLT
네번째 줄에 Old Oak Tap
라고 쓰여있는 것을 보니 우리가 찾는 정보가 맞는것 같다.

# 우리가 찾는 명령어인지 확인하기 위해 len을 사용해 본다. rank 50위까지 있으니 50이 나와야 한다.
len(soup.find_all('div','sammy'))

50

#그 중 첫번째 것을 확인해 본다.
print(soup.find_all('div','sammy')[0])

<div class="sammy" style="position: relative;">
<div class="sammyRank">1</div>
<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br>
Old Oak Tap<br>
<em>Read more</em> </br></br></a></div>
</div>

3-4 접근한 웹 페이지에서 원하는 데이터 추출하고 정리하기¶

그럼 이제 div의 sammy 태그에서 우리가 원하는 정보를 얻는 과정을 보겠다.

tmp_one=soup.find_all('div','sammy')[0]
type(tmp_one)

bs4.element.Tag

find_all로 찾은 결과는 bs4.element.Tag라고 하는 형태로 이런 경우 변수에 다시 태그로 찾는 (find, find_all) 명령을 사용할 수 있다.

tmp_one.find(class_='sammyRank')

<div class="sammyRank">1</div>

그래서 find 명령을 한 번 더 사용하고 bars-front-rank을 찾아보면 나타난다.

# text만 취한다.
tmp_one.find(class_='sammyRank').get_text()

'1'

# 메뉴이름과 가게 이름도 구해준다.
tmp_one.find(class_='sammyListing').get_text()

'BLT\r\nOld Oak Tap\nRead more '

가게 이름과 메뉴 이름이 같이 나오긴 했지만 얻게 되었다.

# a 태그에서 href 정보를 가지고 클릭했을때 연결될 주소도 저장.
tmp_one.find('a')['href']

'/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/'

메뉴와 가게 이름이 잘 분리되어 있지 않으므로 '정규식(Regular Express)'를 활용하여 나눌 수 있다.¶

import re

tmp_string = tmp_one.find(class_='sammyListing').get_text()

#\n 이거나 \r\n이면 split 한다.
re.split(('\n|\r\n'), tmp_string)      

print(re.split(('\n|\r\n'), tmp_string)[0])
print(re.split(('\n|\r\n'), tmp_string)[1])

BLT
Old Oak Tap

from urllib.parse import urljoin

# 리스트에 레스토랑 정보들을 모두 append 한다.
rank=[]
cafe_name=[]
main_menu=[]
url_add=[]

list_soup = soup.find_all('div','sammy')

for item in list_soup:
    rank.append(item.find(class_='sammyRank').get_text())
    
    tmp_string = item.find(class_='sammyListing').get_text()
    
    cafe_name.append(re.split(('\n|\r\n'), tmp_string)[1])
    main_menu.append(re.split(('\n|\r\n'), tmp_string)[0])
                     
    url_add.append(urljoin(url_base, item.find('a')['href']))

# 랭킹 잘 구해졌나~
rank[:5]

['1', '2', '3', '4', '5']

# 카페이름도 잘 추가 됐나~
cafe_name[:5]

['Old Oak Tap', 'Au Cheval', 'Xoco', 'Al’s Deli', 'Publican Quality Meats']

# 메인 메뉴도!??
main_menu[:5]

['BLT', 'Fried Bologna', 'Woodland Mushroom', 'Roast Beef', 'PB&L']

# url도 잘 들어가 있겄제~?
url_add[:5]

['https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/',
 'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Au-Cheval-Fried-Bologna/',
 'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Xoco-Woodland-Mushroom/',
 'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Als-Deli-Roast-Beef/',
 'https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Publican-Quality-Meats-PB-L/']

# 전부 추가 되었는지 확인
len(rank),len(cafe_name),len(main_menu),len(url_add)

(50, 50, 50, 50)

# 판다스를 이용하여 데이터 프레임으로 만들자
import pandas as pd

data = {'Rank':rank, 'Cafe':cafe_name, 'Menu':main_menu, 'URL':url_add}
df = pd.DataFrame(data)
df.head()

# 순서가 마음에 안든다면 다음과 같이 정리
df = pd.DataFrame(data, columns=['Rank', 'Cafe', 'Menu', 'URL'])
df.head(5)

# 내마음에 저장~
df.to_csv('pydata/03. best_sandwiches_list_chicago.csv', sep=',',
         encoding='UTF-8')

시카고 샌드위치 맛집 분석-3 (0)	2019.07.08
시카고 샌드위치 맛집 분석-2 (0)	2019.07.08
서울시 범죄 현황 분석 -4 (2)	2019.07.03
서울시 범죄 현황 분석 -3 (0)	2019.07.03
서울시 범죄 현황 분석 -2 (0)	2019.07.03

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

조환희의 학습 블로그

티스토리 뷰

시카고 샌드위치 맛집 분석 -1