BeautifulSoup不全患兒find_all

-1

我裝盤從以下HTML腳本報廢「產品技術含量的部分行」類下的div嵌套實例：BeautifulSoup不全患兒find_all

<h2 class="product-tech-section-title"> 
    Présentation de la TV SAMSUNG UE49MU9005</h2> 

<div class="product-tech-section-row"> 
    <div> 
     Désignation</b> : 
    </div> 
    <div> 
     <b>SAMSUNG UE49MU9005</b> (UE 49MU9005 TXXC)<br><br>Plus d'informations sur les <a    href="http://www.lcd-compare.com/info-tv-led-samsung.htm" title="TV Samsung : informations et statistiques">TV LED Samsung</a><br><a href="http://www.lcd-compare.com/tv-liste-122.htm?tv_label=7,8" title="Liste des TV 4K">Voir les TV 4K (Ultra HD ou Quad HD)</a></div> 
</div> 


<div class="product-tech-section-row"> 
    <div> 
     Date de sortie (approx.)</b> : 
    </div> 
    <div> 
     Mars 2017</div> 
</div>

但是，使用find_all（）將只提取第一個div子（只有Désignation，SAMSUNG UE ...不會出現），如下面我的代碼所示。我錯過了什麼嗎？幫助將不勝感激。

from urllib.request import urlopen as uReq 
from urllib.request import Request 
from bs4 import BeautifulSoup as soup 

#Allowing access to the website (personal use) 
prod_url="http://www.lcd-compare.com/televiseur-SAMUE49MU9005-SAMSUNG-UE49MU9005.htm" 
hdr = {'User-Agent': 'Mozilla/5.0'} 
req = Request(prod_url,headers=hdr) 
prod_html=uReq(req) 

#Parsing the technical details 
tec_list = prod_soup.find_all("div",{"class","product-tech-section-row"}) 

--------------------------------------------------------------------------------------- 
#However, this is what I am getting: 
>>>print(tec_list[0]) 
<div class="product-tech-section-row"> 
<div> 
Désignation</div></div> 

>>>print(tec_list[0].findChildren()) 
[<div> 
Désignation<\div>]

來源

2017-11-04 p404

嘗試打印（tec_list [1]），這將讓你的「SAMSUNG UE49MU9005」的結果。請記住，find_all（）會返回存儲在tec_list中的已分類元素的列表。 – Ali

感謝您的回覆，不幸的是，print（tec_list [1]）只會返回「產品技術部分行」類 – p404

您好p404請在下面檢查我的答案。 – Ali

我相信你不能廢棄嵌套元素的原因是因爲你訪問的網站大量使用Javascript。

我已經使用硒來驗證是否是這種情況，我能夠正常解析嵌套元素，沒有任何問題。

代碼：

from selenium import webdriver 
from bs4 import BeautifulSoup 

chromeOptions = Options() 
chromeOptions.add_argument("--headless") 
driver = webdriver.Chrome(chrome_options=chromeOptions) 
url = 'http://www.lcd-compare.com/televiseur-SAMUE49MU9005-SAMSUNG-UE49MU9005.htm' 
driver.get(url) 
soup = BeautifulSoup(driver.page_source, 'html.parser') 
tec_list = soup.findAll("div",{"class","product-tech-section-row"}) 

print(tec_list[0])

輸出：

<div class="product-tech-section-row"> 
<div> 
Désignation : 
</div> 
<div> 
<b>SAMSUNG UE49MU9005</b> (UE 49MU9005 TXXC)<br/><br/>Plus d'informations sur les <a data-hasqtip="139" href="http://www.lcd-compare.com/info-tv-led-samsung.htm" oldtitle="TV Samsung : informations et statistiques" title="">TV LED Samsung</a><br/><a data-hasqtip="141" href="http://www.lcd-compare.com/tv-liste-122.htm?tv_label=7,8" oldtitle="Liste des TV 4K" title="">Voir les TV 4K (Ultra HD ou Quad HD)</a></div> 
</div>

來源

2017-11-04 03:23:53 Ali

謝謝阿里！你的建議工作得很好。順便說一下，我想問你是否有其他一些庫可以完成同樣的工作，但不涉及瀏覽器。這樣可以很容易地將這種代碼添加到Web API中。 – p404

@ p404，對不起，延遲迴復。我真的不知道任何其他能夠實現您的目標的圖書館。但繼續搜索。 – Ali

嗨阿里，我做了一些研究，發現PhantomJS無頭瀏覽器可以完成。它也可以從selenium webdriver加載，如：driver = webdriver.PhantomJS（）。我希望你也可以在未來發現它有用。 – p404

BeautifulSoup不全患兒find_all

回答

相關問題