我對Python的webscraping世界很陌生,但我想開發的終極技能是將刮取的數據存儲到數據庫中並定期刷新數據。Python美麗的湯網頁剪輯:只返回新的數據?
我的問題是:如何節省數據請求(時間,帶寬使用),只請求自上次運行腳本以來新增的數據?
例如,我的代碼返回在網站上Autotrader汽車的上市:
from bs4 import BeautifulSoup
import requests
#URL and headers so it thinks we are a browser
url = "https://www.autotrader.co.uk/car-search?search-target=usedcars&is-quick-search=true&radius=&onesearchad=used&onesearchad=nearlynew&onesearchad=new&make=AC&model=&price-from=&price-to=&postcode=sw65bg"
headers = {'User-Agent' : 'Mozilla/5.0'}
#Request
request = requests.get(url, headers)
soup = BeautifulSoup(request.text, "html.parser")
#Find the name box
name_box = soup.find_all('h2', attrs={'class' : 'listing-title'})
#Print the name_box results to see them
for listing in range(len(name_box)):
temp = name_box[listing]
value = temp.text
print(value)
而不是使用一個數據庫的,我可以輸出存儲在一個數據幀,以幫助說明我的問題:
data = pd.DataFrame(columns=['A'])
#Print the name_box results to see them
for listing in range(len(name_box)):
temp = name_box[listing]
value = temp.text
data = data.append({'A' : value}, ignore_index=True)
,輸出:
A
0 AC Cobra 6.3 2dr
1 AC Cobra 4.9 MK IV 2dr
2 AC Cobra 3.5 2dr
3 AC Cobra 3.5 2dr
4 AC Cobra 5.3 2dr
5 AC Cobra 5.7
6 AC Cobra 4736 Built By Gardner Douglas 4.7 2dr
7 AC Cobra 5.7
8 AC Cobra 5.7 2dr
9 AC Cobra 5.8
如果一個10 AC眼鏡蛇出現在網站上,是有沒有辦法顯示或附加新條目,以便我可以識別出現的新條目?