Python BeautifulSoup只選擇頂部標籤

我遇到一個問題，它可能很容易，但我沒有在文檔中看到它。Python BeautifulSoup只選擇頂部標籤

這裏是目標html結構，非常簡單。

<h3>Top 
    <em>Mid</em> 
    <span>Down</span> 
</h3>

我想這是h3標籤內的「頂」的文字，我寫這個

from bs4 import BeautifulSoup 
html ="<h3>Top <em>Mid </em><span>Down</span></h3>" 
soup = BeautifulSoup(html) 
print soup.select("h3")[0].text

但它會返回Top Mid Down，我怎麼修改呢？

來源

2016-07-25 Coda Chang

得到每個標籤內的數據，你可以使用找到設置文本= True and recursive = False：

In [2]: from bs4 import BeautifulSoup 
    ...: html ="<h3>Top <em>Mid </em><span>Down</span></h3>" 
    ...: soup = BeautifulSoup(html,"html.parser") 
    ...: print(soup.find("h3").find(text=True,recursive=False)) 
    ...: 
Top

根據格式，有很多不同的方式：

print(soup.find("h3").contents[0]) 
print(next(soup.find("h3").children)) 
print(soup.find("h3").next)

來源

2016-07-25 10:48:45

謝謝，我會檢查更多關於'contents'和'children'的細節 –

嘗試這樣：

from bs4 import BeautifulSoup 
html ="<h3>Top <em>Mid </em><span>Down</span></h3>" 
soup = BeautifulSoup(html) 
print soup.select("h3").findChildren()[0]

雖然我不能完全肯定。檢查此 - How to find children of nodes using Beautiful Soup

基本上你需要狩獵第一childNode。

來源

2016-07-25 10:21:46 kawadhiya21

。在你的代碼的語法錯誤，但感謝您的信息。 –

-1

它容易讓你使用正則表達式像這樣

pageid=re.search('<h3>(.*?)</h3>', curPage, re.DOTALL)

搜索和使用pageid.group(value)方法

來源

2016-07-25 10:34:22

謝謝，但我認爲在BeautifulSoup中獲得內容會更容易。 –

Python BeautifulSoup只選擇頂部標籤

回答

相關問題