循環掃描多個網頁無法循環

我目前正試圖在TripAdvisor上搜索新加坡的500家頂級餐廳;然而，我目前的代碼只能拖動前30個並保持循環，直到它打印前30個，直到達到500條記錄。我希望它能打印出前30頁，然後打印下一頁30頁等等。我想知道是否有人可以看看我的代碼，看看它爲什麼這樣做。循環掃描多個網頁無法循環

#loop to move into the next pages. entries are in increments of 30 per page 
for i in range(0, 500, 30): 
    #url format offsets the restaurants in increments of 30 after the oa 
    #change key and geography here 
    url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa' + str(i) + 'Singapore.html#EATERY_LIST_CONTENTS' 
    r1 = requests.get(url1) 
    data1 = r1.text 
    soup1 = BeautifulSoup(data1, "html.parser") 
    for link in soup1.findAll('a', {'property_title'}): 
     #change key here 
     restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href') 
     print restaurant_url

來源

2016-12-06 dtrinh

我覺得你在這裏做不正確的網址：

url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa' + str(i) + 'Singapore.html#EATERY_LIST_CONTENTS'

正確的URL格式應該是：

url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa{0}-Singapore.html#EATERY_LIST_CONTENTS'.format(i)

注意「頁偏移」後的衝刺。

我也將保持一個Web刮會議，提高了變量的命名：

import requests 
from bs4 import BeautifulSoup 


with requests.Session() as session: 
    for offset in range(0, 500, 30): 
     url = 'https://www.tripadvisor.com/Restaurants-g294265-oa{0}-Singapore.html#EATERY_LIST_CONTENTS'.format(offset) 

     soup = BeautifulSoup(session.get(url).content, "html.parser") 
     for link in soup.select('a.property_title'): 
      restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href') 
      print(restaurant_url)

而且，想想將延遲後續請求之間是一個更好的web-scraping citizen。

來源

2016-12-06 17:15:15 alecxe

工作就像一個魅力。並感謝您鏈接該會話文章。我對這個社區很陌生，所以任何事情都很有幫助！ – dtrinh

循環掃描多個網頁無法循環

回答

相關問題