2017-11-25 133 views
2

我想從網站上刮取一些數據。這是html格式。我想湊字"No description for 632930413867".用於網頁瀏覽的美化工具不起作用?

HTML代碼:

<div class="col-xs-6 col-sm-6 col-md-6 col-lg-6"> 
    <table class="table product_info_table"> 
    <tbody> 
     <tr> 
     <td>GS1 Address</td> 
     <td>R.R. 1, Box 2, Malmo, NE 68040</td> 
     </tr> 
     <tr> 
     <td>Description</td> 
     <td> 
      <div id="read_desc"> 
      No description for 632930413867 
      </div> 
     </td> 
     </tr> 
    </tbody> 
    </table> 
</div> 

和圖片src從這個網站

<div class="centered_image header_image"> 
<img src="https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg" title="UPC 632930413867" alt="UPC 632930413867"> 

所以我用這個代碼

Baseurl = "https://www.buycott.com/upc/632930413867" 
uClient = '' 
while uClient == '': 
    try: 
     uClient = requests.get(Baseurl) 
     print("Relax we are getting the data...") 

    except: 
     print("Connection refused by the server..") 
     print("Let me sleep for 7 seconds") 
     time.sleep(7) 
     print("Was a nice sleep, now let me continue...") 
     continue 


page_html = uClient.content 

uClient.close() 
page_soup = soup(page_html, "html.parser") 

Productcontainer = page_soup.find_all("div", {"class": "row"}) 
link = page_soup.find(itemprop="image") 

print(Productcontainer) 

for item in Productcontainer: 
    print(link) 
    productdescription = Productcontainer.find("div", {"class": "product_info_table"}) 
    print(productdescription) 

當我運行此代碼時,不顯示數據。我如何獲得描述和img src?

回答

3

只有一個頁面上的每個(項目和產品描述)的實例,以便你可以去他們直接使用find(),就沒有必要在這種情況下使用find_all():

import requests 
from bs4 import BeautifulSoup as soup 

Baseurl = "https://www.buycott.com/upc/632930413867" 
uClient = '' 
while uClient == '': 
    try: 
     uClient = requests.get(Baseurl) 
     print("Relax we are getting the data...") 

    except: 
     print("Connection refused by the server..") 
     print("Let me sleep for 7 seconds") 
     time.sleep(7) 
     print("Was a nice sleep, now let me continue...") 
     continue 

page_html = uClient.content 
uClient.close() 

page_soup = soup(page_html, "html.parser") 
productdescription = page_soup.find("div", {"id": "read_desc"}).text 
link = page_soup.find("div", {"class": "centered_image header_image"}).find("img")['src'] 
print (productdescription) 
print (link) 

輸出:

Relax we are getting the data... 

No description for 632930413867 

https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg 
2

你只需要檢查HTML和標識按住要刮的數據標籤。
在這種情況下,圖像爲div.centered_image.header_image img,而div#read_desc爲描述。
bs4 css selectors一個例子:

import requests 
from bs4 import BeautifulSoup 

baseurl = "https://www.buycott.com/upc/632930413867" 
page_html = requests.get(baseurl).content 
soup = BeautifulSoup(page_html, "html.parser") 
image = soup.select_one('div.centered_image.header_image img')['src'] 
description = soup.select_one('div#read_desc').text.strip() 

print(image) 
print(description) 

https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL.SL160.jpg
爲632930413867

0

沒有描述這可以這樣來完成,以及:

import requests 
from bs4 import BeautifulSoup 

soup = BeautifulSoup(requests.get("https://www.buycott.com/upc/632930413867").text, "lxml") 
desc = soup.select("#read_desc")[0].text.strip() 
link = soup.select(".centered_image img")[0]['src'].strip() 
print("{}\n{}".format(desc,link)) 

輸出:

No description for 632930413867 
https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg 
相關問題