刮網址

我使用Python 3.5，並試圖刮URL列表（同一網站）的列表，代碼如下：刮網址

import urllib.request 
from bs4 import BeautifulSoup 



url_list = ['URL1', 
      'URL2','URL3] 

def soup(): 
    for url in url_list: 
     sauce = urllib.request.urlopen(url) 
     for things in sauce: 
      soup_maker = BeautifulSoup(things, 'html.parser') 
      return soup_maker 

# Scraping 
def getPropNames(): 
    for propName in soup.findAll('div', class_="property-cta"): 
     for h1 in propName.findAll('h1'): 
      print(h1.text) 

def getPrice(): 
    for price in soup.findAll('p', class_="room-price"): 
     print(price.text) 

def getRoom(): 
    for theRoom in soup.findAll('div', class_="featured-item-inner"): 
     for h5 in theRoom.findAll('h5'): 
      print(h5.text) 


for soups in soup(): 
    getPropNames() 
    getPrice() 
    getRoom()

到目前爲止，如果我打印的湯，讓propNames， getPrice或getRoom他們似乎工作。但我似乎無法通過每個URL並打印getPropNames，getPrice和getRoom。

只有在幾個月的時間裏才學習Python，所以非常感謝您的幫助！

來源

2017-02-17 Maverick

試想一下這個代碼做：

def soup(): 
    for url in url_list: 
     sauce = urllib.request.urlopen(url) 
     for things in sauce: 
      soup_maker = BeautifulSoup(things, 'html.parser') 
      return soup_maker

讓我告訴你一個例子：

def soup2(): 
    for url in url_list: 
     print(url) 
     for thing in ['a', 'b', 'c']: 
      print(url, thing) 
      maker = 2 * thing 
      return maker

而且輸出url_list = ['one', 'two', 'three']是：

one 
('one', 'a')

你現在看到？到底是怎麼回事？

基本上你的湯功能首先返回return - 不返回任何迭代器，任何列表;只有第一BeautifulSoup - 你是幸運的（或不），這是迭代:)

所以更改代碼：

def soup3(): 
    soups = [] 
    for url in url_list: 
     print(url) 
     for thing in ['a', 'b', 'c']: 
      print(url, thing) 
      maker = 2 * thing 
      soups.append(maker) 
    return soups

然後輸出爲：

one 
('one', 'a') 
('one', 'b') 
('one', 'c') 
two 
('two', 'a') 
('two', 'b') 
('two', 'c') 
three 
('three', 'a') 
('three', 'b') 
('three', 'c')

但我相信，這也不會工作:)只是想知道什麼是由醬返回：sauce = urllib.request.urlopen(url)和實際上你的代碼迭代：for things in sauce - 意思是things是什麼。

快樂編碼。

來源

2017-02-17 13:44:33 opalczynski

謝謝SebastianOpałczyński，我會把它放在船上，試着讓我的頭靠近它，讓你知道結果！ – Maverick

get*函數中的每一個都使用全局變量soup，該函數在任何地方都沒有正確設置。即使是這樣，這也不是一個好方法。讓soup函數參數代替，例如：

def getRoom(soup): 
    for theRoom in soup.findAll('div', class_="featured-item-inner"): 
     for h5 in theRoom.findAll('h5'): 
      print(h5.text) 

for soup in soups(): 
    getPropNames(soup) 
    getPrice(soup) 
    getRoom(soup)

其次，你應該做而從yield代替soup()的return把它變成一臺發電機。否則，您需要返回一個BeautifulSoup對象的列表。

def soups(): 
    for url in url_list: 
     sauce = urllib.request.urlopen(url) 
     for things in sauce: 
      soup_maker = BeautifulSoup(things, 'html.parser') 
      yield soup_maker

我還建議使用XPath或CSS選擇器來提取HTML元素：https://stackoverflow.com/a/11466033/2997179。

來源

2017-02-17 13:47:03

謝謝Martin Valgur，這很有見地 - 我會研究Xpath/CSS。在應用您的建議時，我收到以下錯誤消息：AttributeError：'function'對象沒有屬性'findAll - 任何想法？ – Maverick

您是否將'soup'參數添加到所有功能？我還建議將'soup（）'函數重命名爲'soups（）'。 –

謝謝，那是我錯了！但是，它似乎只適用於getPrice。其他2不返回任何東西？奇怪，因爲當我第一次寫這些功能，我使用1個網址，他們都完美地工作。 – Maverick

回答

相關問題