如何使用beautifulsoup在亞馬遜網頁上颳去產品詳細信息

對於網頁：http://www.amazon.com/Harry-Potter-Prisoner-Azkaban-Rowling/dp/0439136369/ref=pd_sim_b_2?ie=UTF8&refRID=1MFBRAECGPMVZC5MJCWG 如何在python中刮取產品詳細信息並輸出dict。在上述情況下，字典輸出我想有會：如何使用beautifulsoup在亞馬遜網頁上颳去產品詳細信息

Age Range: 9 - 12 years 
Grade Level: 4 - 7 
... 
...

我是新來beautifulsoup並沒有找到很好的例子，來實現這一目標。我想要舉一些例子。

來源

2014-10-31 so3

你有沒有做過任何嘗試？ – 2014-10-31 20:28:44

你到目前爲止嘗試過什麼？ – Hackaholic 2014-10-31 20:30:57

看看'mechanize'和'BeautifulSoup'，看看這個答案的例子：http://stackoverflow.com/a/19284156/2327821通常，你應該做更多的腿工作，然後再問你這樣一個開放最終的問題。 – Michael 2014-10-31 20:35:41

的想法是所有Product Details項目迭代與table#productDetailsTable div.content ul liCSS selector的幫助下，然後使用粗體文字作爲重點和next sibling作爲值：

from pprint import pprint 
from bs4 import BeautifulSoup 
import requests 

url = 'http://www.amazon.com/dp/0439136369' 
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}) 

soup = BeautifulSoup(response.content) 
tags = {} 
for li in soup.select('table#productDetailsTable div.content ul li'): 
    try: 
     title = li.b 
     key = title.text.strip().rstrip(':') 
     value = title.next_sibling.strip() 

     tags[key] = value 
    except AttributeError: 
     break 

pprint(tags)

打印：

{ 
    u'Age Range': u'9 - 12 years', 
    u'Amazon Best Sellers Rank': u'#1,440 in Books (', 
    u'Average Customer Review': u'', 
    u'Grade Level': u'4 - 7', 
    u'ISBN-10': u'0439136369', 
    u'ISBN-13': u'978-0439136365', 
    u'Language': u'English', 
    u'Lexile Measure': u'880L', 
    u'Mass Market Paperback': u'448 pages', 
    u'Product Dimensions': u'1.2 x 5.2 x 7.8 inches', 
    u'Publisher': u'Scholastic Paperbacks (September 11, 2001)', 
    u'Series': u'Harry Potter (Book 3)', 
    u'Shipping Weight': u'11.2 ounces (' 
}

請注意，只要我們點擊了AttributeError，我們就打破了循環。發生在li元素內部沒有更多粗體文本時發生。

來源

2014-10-31 23:03:09 alecxe

謝謝你的回答。但爲什麼你把標題信息放在requests.get中？ – so3 2014-11-02 17:47:53

@ so3它只是我很習慣這樣做:) – alecxe 2014-11-02 19:03:55

@alecxe你知道我爲什麼只有{'Age Range'：'9 - 12 years'，'Grade Level'：'4 - 7'} when我將「html.parser」參數傳遞給soup = BeautifulSoup（response.content，「html.parser」）？ – multigoodverse 2015-12-20 09:50:52

from bs4 import BeautifulSoup 
import urllib 
import urllib2 
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'} 
url = 'http://www.amazon.com/dp/0439136369' 
data = urllib.urlencode(headers) 
req = urllib2.Request(url,data) 
soup = BeautifulSoup(urllib2.urlopen(req).read()) 
for x in soup.find_all('table',id='productDetailsTable'): 
    for tag in x.find_all('li'): 
     tag.get_text()

從上面的代碼，你可以提取表中的文本，我還沒有格式化打印或放在字典，因爲你說你需要一點幫助。所以我在上面的代碼中做了什麼。我需要更改user-agent，因爲亞馬遜不允許python user-agent。使用find_all 我找到id=productDetailsTable'表。那麼我正在循環查找所有li標記，因爲所有信息都存儲在此標記中。

來源

2014-10-31 21:22:38 Hackaholic

如何使用beautifulsoup在亞馬遜網頁上颳去產品詳細信息

回答

相關問題