簡單的Python網絡爬蟲

我在YouTube上遵循一個python教程，並起牀到我們做基本的網絡爬蟲的地方。我試圖讓自己做一件非常簡單的事情。去Craigslist上的我的城市汽車部分，打印每個條目的標題/鏈接，並跳轉到下一頁並在需要時重複。它適用於第一頁，但不會繼續更改頁面並獲取數據。有人可以幫助解釋什麼是錯的？簡單的Python網絡爬蟲

import requests 
from bs4 import BeautifulSoup 

def widow(max_pages): 
    page = 0 # craigslist starts at page 0 
    while page <= max_pages: 
     url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary 
     for link in soup.findAll('a', {'class':'hdrlnk'}): 
      href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html 
      title = link.string 
      print(title) 
      print(href) 
      page += 100 # craigslist pages go 0, 100, 200, etc 

widow(0) # 0 gets the first page, replace with multiples of 100 for extra pages

來源

2016-09-19 v0dkuh

看起來你有你的縮進一個問題，你需要做的 page += 100主，而塊不裏面的for循環。

def widow(max_pages): 
    page = 0 # craigslist starts at page 0 
    while page <= max_pages: 
     url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary 
     for link in soup.findAll('a', {'class':'hdrlnk'}): 
      href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html 
      title = link.string 
      print(title) 
      print(href) 
     page += 100 # craigslist pages go 0, 100, 200, etc

來源

2016-09-19 05:15:10 sisanared

神聖的廢話哇。我現在很笨，現在哈哈。謝謝。 – v0dkuh

這不僅僅是解決方案的一部分嗎？ 'page'遞增，但在示例中'max_pages'設置爲'0'。在第一頁之後，'100 <= 0'將返回False並因此退出循環。 –

OP的評論建議，他會打電話給窗口（0）以獲取第一頁。如果他打電話給窗口（1000），那麼他將繼續刮擦，直到頁面<= 1000 – sisanared

簡單的Python網絡爬蟲

回答

相關問題