2016-12-06 58 views
0

我目前正試圖在TripAdvisor上搜索新加坡的500家頂級餐廳;然而,我目前的代碼只能拖動前30個並保持循環,直到它打印前30個,直到達到500條記錄。我希望它能打印出前30頁,然後打印下一頁30頁等等。我想知道是否有人可以看看我的代碼,看看它爲什麼這樣做。循環掃描多個網頁無法循環

#loop to move into the next pages. entries are in increments of 30 per page 
for i in range(0, 500, 30): 
    #url format offsets the restaurants in increments of 30 after the oa 
    #change key and geography here 
    url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa' + str(i) + 'Singapore.html#EATERY_LIST_CONTENTS' 
    r1 = requests.get(url1) 
    data1 = r1.text 
    soup1 = BeautifulSoup(data1, "html.parser") 
    for link in soup1.findAll('a', {'property_title'}): 
     #change key here 
     restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href') 
     print restaurant_url 

回答

2

我覺得你在這裏做不正確的網址:

url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa' + str(i) + 'Singapore.html#EATERY_LIST_CONTENTS' 

正確的URL格式應該是:

url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa{0}-Singapore.html#EATERY_LIST_CONTENTS'.format(i) 

注意 「頁偏移」 後的衝刺。


我也將保持一個Web刮會議,提高了變量的命名:

import requests 
from bs4 import BeautifulSoup 


with requests.Session() as session: 
    for offset in range(0, 500, 30): 
     url = 'https://www.tripadvisor.com/Restaurants-g294265-oa{0}-Singapore.html#EATERY_LIST_CONTENTS'.format(offset) 

     soup = BeautifulSoup(session.get(url).content, "html.parser") 
     for link in soup.select('a.property_title'): 
      restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href') 
      print(restaurant_url) 

而且,想想將延遲後續請求之間是一個更好的web-scraping citizen

+0

工作就像一個魅力。並感謝您鏈接該會話文章。我對這個社區很陌生,所以任何事情都很有幫助! – dtrinh