如何獲得div與多個類BS4

如果他們有多個類，使用BeautifulSoup4獲得div最有效的方法是什麼？如何獲得div與多個類BS4

我有一個HTML的結構是這樣的：

<div class='class1 class2 class3 class4'> 
    <div class='class5 class6 class7'> 
    <div class='comment class14 class15'> 
     <div class='date class20 showdate'> 1/10/2017</div> 
     <p>comment2</p> 
    </div> 
    <div class='comment class25 class9'> 
     <div class='date class20 showdate'> 7/10/2017</div> 
     <p>comment1</p> 
    </div> 
    </div> 
</div>

我想與評論股利。通常嵌套類沒有問題，但我不知道爲什麼命令：

html = BeautifulSoup(content, "html.parser") 
comments = html.find_all("div", {"class":"comment"})

不起作用。它給出了空陣列。我想這會發生，因爲有很多類，所以他只查看評論類的div，它不存在。我怎樣才能找到所有的評論？

來源

2017-10-08 Elsa Strahmbrand

您使用的是哪個版本的BeautifulSoup？上面的代碼以及'soup.findAll（'div'，{'class'：'comment'}）'適用於我。它檢索兩個'div'標籤。此外，嘗試使用're'模塊並修改參數，如下所示：'soup.findAll（'div'，{'class'：re.compile（r'comment'）'這應該肯定會起作用 – Mahesh

我使用BeautifulSoup4 bs4）我嘗試了相同的代碼，但是使用了另一個站點並且工作正常！所以這個問題並不在多個類中，它可以是一種免於報廢的保護嗎？ –

您能否列出您遇到問題的網站/網址？ – Mahesh

顯然，提取評論部分的URL與檢索主要內容的原始URL不同。

這是你給的原始網址：

http://community.sparknotes.com/2017/10/06/find-out-your-colleges-secret-mantra-we-hack-college-life-at-the-100-of-the-best

在幕後，如果您記錄網絡日誌在Chrome的開發者菜單的網絡選項卡上，你會看到發送的所有網址列表由瀏覽器。其中大部分是用於獲取圖像和腳本。很少涉及其他網站，例如Facebook或Google（用於分析等）。瀏覽器發送另一個請求到這個特定的網站（sparknotes），這給你的評論部分。這是網址：

爲post_id值可以在網頁中找到返回的時候，我們要求第一個URL。它包含在input標記中，該標記具有隱藏的屬性。

<input type="hidden" id="postid" name="postid" value="1375724">

您可以使用簡單的soup.find('input', {'id': 'postid'})['value']從第一個網頁中提取此信息。當然，由於這可以唯一標識帖子，因此您無需擔心其在每個請求上動態更改。

我無法找到傳遞給'_'參數（URL的最後一個參數）的'1507467541548'值，它位於主頁面的任何地方或任何頁面的響應標頭設置的cookie中的任何位置。

但是，我走了出去，試圖通過傳遞沒有'_'參數來獲取URL，並且它工作。

所以，這裏是爲我工作的整個腳本：

from bs4 import BeautifulSoup 
import requests 

req_headers = { 
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 
    'Accept-Encoding': 'gzip, deflate', 
    'Accept-Language': 'en-US,en;q=0.8', 
    'Connection': 'keep-alive', 
    'Host': 'community.sparknotes.com', 
    'Upgrade-Insecure-Requests': '1', 
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' 
} 

with requests.Session() as s: 
    url = 'http://community.sparknotes.com/2017/10/06/find-out-your-colleges-secret-mantra-we-hack-college-life-at-the-100-of-the-best' 
    r = s.get(url, headers=req_headers) 

    soup = BeautifulSoup(r.content, 'lxml') 
    post_id = soup.find('input', {'id': 'postid'})['value'] 

    # url = 'http://community.sparknotes.com/commentlist?post_id=1375724&page=1&comment_type=&_=1507467541548' # the original URL found in network tab 
    url = 'http://community.sparknotes.com/commentlist?post_id={}&page=1&comment_type='.format(post_id) # modified by removing the '_' parameter 

    r = s.get(url) 

    soup = BeautifulSoup(r.content, 'lxml') 
    comments = soup.findAll('div', {'class': 'commentCite'}) 

    for comment in comments: 
     c_name = comment.div.a.text.strip() 
     c_date_text = comment.find('div', {'class': 'commentBodyInner'}).text.strip() 
     print(c_name, c_date_text)

第二requests.get正如你看到的，我沒有用過頭。所以我不確定是否需要它。您也可以在第一個請求中嘗試省略它們。但請確保您使用requests，因爲我還沒有嘗試過使用urllib。 Cookie可能在這裏扮演重要角色。

來源

2017-10-08 13:40:23 Mahesh

如何獲得div與多個類BS4

回答

相關問題