Beautifulsoup標籤 - find_all成功但發現失敗

我遇到問題，其中特定標籤h2的soup.find_all成功，但指定文本的soup.find失敗。Beautifulsoup標籤 - find_all成功但發現失敗

我需要找到帶有各種文本的h2標籤，如介紹，結果等，如附圖所示。

有人能請指教嗎？謝謝。

print(soup.find_all('h2')) 
[<h2 class="Heading">Abstract</h2>, 
<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Introduction<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Patients and methods<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Results<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Discussion<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" id="copyrightInformation" tabindex="-1">Copyright information<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" id="aboutarticle" tabindex="-1">About this article<span class="section-icon"></span></h2>, 
<h2 class="u-isVisuallyHidden">Article actions</h2>, <h2 class="u-h4 u-jsIsVisuallyHidden">Article contents</h2>, 
<h2 class="u-isVisuallyHidden">Cookies</h2>] 

print(soup.find('h2', text='Introduction')) 
None

來源

2017-02-13 Michael Lam

請顯示您用作輸入的HTML文檔。謝謝！更多幫助：http://stackoverflow.com/help/mcve。 –

試試這個：

soup.find(lambda el: el.name == "h2" and "Introduction" in el.text)

來源

2017-02-13 09:48:51

謝謝。但是，我還需要將h2標籤純粹地稱爲「簡介」，而不需要匹配其他字符串，例如「文檔簡介」等。除了添加正則表達式去除span標籤外，還有其他方法嗎？ –

'soup.find（lambda el：el.name ==「h2」and el.text =='Introduction'）' –

謝謝。這是一個很好的解決方案。在問題中沒有使用text =「Introduction」時，這種方法如何解決問題？在我看來，他們都在尋找文本變量爲「簡介」。 –

-2

text='Introduction'搜索navigable strings，不tags

從文檔：

文本是一個參數，可以讓你搜索NavigableString對象而不是標籤

你應該嘗試：

print(soup.find(text='Introduction').parent)

來源

2017-02-13 09:51:05

「介紹」文本可以在任何標籤中（例如'span'中）。 –

OP沒有試圖用'text ='Introduction''搜索標籤，你一定誤解了這個問題。 –

你是對的NavigableString，但在這種情況下，它是'None'，因爲'h2'標籤包含另一個標籤 –

當我們使用text/string作爲過濾器，發動機罩下所發生的事情是，我們使用tag.string來獲取文本，並與過濾器相比較，在這種情況下：

import bs4 

html = '''<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Introduction<span class="section-icon"></span></h2>''' 
soup = bs4.BeautifulSoup(html,'lxml') 
print(soup.h2.string)

出來：

None

瓦y中的字符串返回None：

如果一個標籤包含一個以上的東西，那麼它不清楚是什麼 .string應參照，所以.string被定義爲無：

的h2標籤包含空文本span標籤，它很困惑，並且會返回None

@Thomas Lehoux的答案是正確的做法。

這是BS3 API：

findNextSiblings(name, attrs, text, limit, **kwargs)

這是BS4 API：

find_next_siblings(name, attrs, string, limit, **kwargs)

你會發現，舊的使用text，當前的使用string，但他們都是一樣的，他們都使用tag.string來獲得價值，你可以使用他們兩個。 BS4只是運用舊格式，就是這樣。

我在這兩個版本中找不到任何tag.text API，但它的行爲如同tag.get_text()，它連接了標籤下的所有文本。

在你的情況：

soup.h2.string >>> None 
soup.h2.text  >>> Introduction 
soup.h2.get_text()>>> Introduction

簡而言之：

text in filter is tag.string 
text in tag itself is tag.text

我想你在實踐中運用find(string=' ')，這是減少混亂。

來源

2017-02-13 13:24:28

謝謝。我沒有意識到它會被span標記混淆。是的，我認爲@Thomas Lehoux的方法接近了。但我想要h2標籤，其中的字符串純粹是「簡介」。除了添加正則表達式去除span標籤外，還有其他方法嗎？ –

@Michael Lam'soup.find（lambda el：el.name ==「h2」and el.text ==「Introduction」）.text'，因爲span標籤的文本是空的，文本將返回確切的h2字符串tag –

爲什麼傳入帶有文本==「簡介」的lambda函數與使用.find（「h2」，text =「簡介」）的工作方式不同？ –

Beautifulsoup標籤 - find_all成功但發現失敗

回答

相關問題