내일배움캠프 210915 : TIL (Today I learned)

1. 웹 스크래핑 기록

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('url',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

title = soup. select_one(copy selector) <- 1개를 찾을 경우

print(tltle)

여기서 headers 를 넣은 이유는 코드단에서 요청을 했을 때, 기본적인 요청을 막아두는 사이트가 많다.
그래서 브라우저에서 엔터 친 것처럼 효과를 내어준다.

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('url',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

trs = soup.select(copy selector)

for tr in trs:
print(tr)

이렇게 하면 하나씩 전체 주르륵 나온다.

그리고 print(tr) 부분을 빼고

a = tr.select_one(copy selector)

print(a) 를 하면 찾고자 한 내용이 주르륵 나온다.

여기서 print(a) 를 빼고
if a is not none:
title = a.text
print(title)

여기서 print(title) 를 빼고,
doc = {
a : a
}
db.users.insert_one(doc)
이렇게 하면 몽고 db로 저장한다.

이렇게 하면 none 를 뺀 나머지 값을 쭈욱 출력한다.

2. 메타태그를 이용한 크롤링

import requests
from bs4 import BeautifulSoup

url = ''

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get(url,headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

title = soup.select_one('meta[property="og:title"]')['content']

print(title)

3. 몽고 db 데이터 가져오기

from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.dbsparta

user = db.users.find_one({'name':'bobby'})

print(user)
이렇게 하면 name 에 bobby 의 딕셔너리를 찾아온다. 그리고 여기서 [''] 여기 안에 추가 정보를 넣을 경우 그것만 출력한다.

from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.dbsparta

user = db.users.find_one({'name':'bobby'})
a = user['star']

b = list(db.users.find({'star':a},{'_id':False}))

print(b)
출력 시 star 영역에서 a와 일치하는 것을 모두 찾아낸다.
그리고 한 줄로 쭈욱 나온다.
그래서
print(b) 를 빼고

for c in b:
print(c)

이렇게 하면 한 줄씩 아래로 나온다.

4. 오늘의 에러 및 해결

werkzeug.exceptions.BadRequestKeyError: 400 Bad Request: The browser (or proxy) sent a request that this server could not understand. -> 코드 문제점 찾아보았지만 없었고.. app run후 로컬 웹 새로고침 후 하니 정상 작동되었다. 수정 시 run과 새로고침을 하고 테스트해야겠다.
TypeError: 'Collection' object is not callable. If you meant to call the 'mystar_one' method on a 'Collection' object it is failing because no such method exists. -> 호출 에러 컬렉션 이름과 find 부분 확인해서 해결 완료
TypeError: The view function for 'delete_star' did not return a valid response. The function either returned None or ended without a return statement. -> return 영역 확인 후 해결 완료

5. get 과 pot 역할

GET
request.args.get('key') 로 원하는 데이터를 바로 받아올 수 있다.

POST
request.form['key'] 형식으로 데이터를 받는다.

6. 크롤링 궁금증

무비스타 크롤링에서
    urls = []
    for tr in trs:
        a = tr.select_one('td.title > a')
        if a is not None:
            base_url = '주소'
            url = base_url + a['href']
            urls.append(url)

    return urls

여기서 urls 무슨 작업을 하는 것일까?

inset_star(url) 안에 있는 url을 모아둔 list라고 생각하자.
쓰기 편하게 미리 만들어서 리스트업 해놓은 거라고 생각하면 된다.

10개 링크를 크롤링해서 저장해야 하는데 그때마다 url을 만들어서 하면 코드도 복잡해지고
지저분하니까 먼저 url을 리스트업 해두고 아래에서 추가 작업한다고 생각하면 된다.