0

我想要使用請求刮黃頁。我知道登錄不需要在這些頁面上獲取數據,但我只是想讓練習登錄到網站上。有沒有更簡單的方法來使用字典刮多個網頁?

有沒有辦法使用「s.get()」一次抓取多個url?這是我目前的代碼佈局,但似乎應該有一個更簡單的方法,這樣我不必每次添加新的頁面時都要編寫五行代碼。

此代碼適用於我,但似乎太長。

import requests 
from bs4 import BeautifulSoup 
import requests.cookies 

s = requests.Session() 

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'} 

url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed&register=true" 
r = s.get(url).content 
page = s.get(url) 
soup = BeautifulSoup(page.content, "lxml") 
soup.prettify() 

csrf = soup.find("input", value=True)["value"] 

USERNAME = 'myusername' 
PASSWORD = 'mypassword' 

cj = s.cookies 
requests.utils.dict_from_cookiejar(cj) 

login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf) 
s.post(url, data=login_data, headers={'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"}) 

targeted_page = s.get('http://m.yp.com/search?search_term=restaurants&search_type=category', cookies=cj) 

targeted_soup = BeautifulSoup(targeted_page.content, "lxml") 

targeted_soup.prettify() 

for record in targeted_soup.findAll('div'): 
    print(record.text) 

targeted_page_2 = s.get('http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA', cookies=cj) 

targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml") 

targeted_soup_2.prettify() 

for data in targeted_soup_2.findAll('div'): 
    print(data.text) 

當我嘗試使用這樣的字典時,我得到一個我不明白的回溯。

import requests 
from bs4 import BeautifulSoup 
import requests.cookies 

s = requests.Session() 

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'} 

url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed&register=true" 
r = s.get(url).content 
page = s.get(url) 
soup = BeautifulSoup(page.content, "lxml") 
soup.prettify() 

csrf = soup.find("input", value=True)["value"] 

USERNAME = 'myusername' 
PASSWORD = 'mypassword' 

login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf) 
s.post(url, data=login_data, headers={'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"}) 

targeted_pages = {'http://m.yp.com/search?search_term=restaurants&search_type=category', 
        'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA' 
        } 
targeted_page = s.get(targeted_pages) 

targeted_soup = BeautifulSoup(targeted_page.content, "lxml") 

targeted_soup.prettify() 

for record in targeted_soup.findAll('div'): 
    print(record.text) 

targeted_page_2 = s.get('http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA') 

targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml") 

targeted_soup_2.prettify() 

錯誤

raise InvalidSchema("No connection adapters were found for '%s'" % url) 
requests.exceptions.InvalidSchema: No connection adapters were found for '{'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA', 'http://m.yp.com/search?search_term=restaurants&search_type=category'}' 

我是新來的Python和請求模塊但是爲什麼使用這種格式的字典裏沒有工作,我不明白。感謝您的任何意見。

回答

0

首先你有設置不是字典,如果你想要求每個你需要遍歷它的URL,requests.get需要一個URL作爲其第一個參數不是一組或其他任何網址:可迭代

targeted_pages = {'http://m.yp.com/search?search_term=restaurants&search_type=category', 
        'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA' 
        } 
for target in targeted_pages: 
    targeted_page = s.get(target) 
    targeted_soup = BeautifulSoup(targeted_page.content, "lxml") 
    for record in targeted_soup.findAll('div'): 
     print(record.text) 
+0

@ user6326823,沒有煩惱,只是傳遞的標頭爲每個請求,所以你正在使用的用戶代理等。 –

+0

好清涼!謝謝您的幫助。我還有另外一個問題,似乎每次我進入一個新鏈接時都會記錄進出,因爲當我打印(s.cookies)時,它會打印兩行相同的cookie。這是否意味着它進出我的日誌?我只是想確保我不會像機器人一樣脫落。我會接受這個答案,因爲它工作完美。 – user6326823

+0

啊,呃,謝謝。 – user6326823

相關問題