美麗的湯 - 抓取第一個指定標記後的字符串

我試圖在開始<td>標記後立即抓住字符串。下面的代碼工作：美麗的湯 - 抓取第一個指定標記後的字符串

webpage = urlopen(i).read() 
soup = BeautifulSoup(webpage) 
for elem in soup('td', text=re.compile(".\.doc")): 
    print elem.parent

當HTML看起來像這樣：

<td>plan_49913.doc</td>

但不是當HTML看起來像這樣：

<td>plan_49913.doc<br /> <font color="#990000">Document superseded by:  </font><a href="/plans/Jan_2012.html">January 2012</a></td>

我試着與attrs一起玩，但無法讓它工作。基本上我只想抓住任何一個html實例中的'plan_49913.doc'。

任何意見將不勝感激。

預先感謝您。

〜作者chrisk

來源

2012-01-06 user1117603

只需使用next屬性，它包含下一個節點，並且這是一個文本節點。

>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>' 
>>> bs = BeautifulSoup(html) 
>>> texts = [ node.next for node in bs.findAll('td') if node.next.endswith('.doc') ] 
>>> texts 
[u'plan_49913.doc']

如果您願意，您可以更改if子句以使用正則表達式。

來源

2012-01-06 16:39:59

謝謝你。這有效......不幸的是，我很難理解「texts =」行中的語法。你會不夠友好地分解它，並像這樣構造它： 'for ... print' – user1117603 2012-01-07 07:25:04

這是一個列表理解。這是一個等價於： 'texts = [];' '節點在bs.findAll（'td'）：' '如果node.next.endswith（'。doc'）：' 'texts.append （node.next）' – 2012-01-07 08:40:25

這個工作對我來說：

>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>' 
>>> soup = BeautifulSoup(html) 
>>> soup.find(text=re.compile('.\.doc')) 
u'plan_49913.doc

有我丟失的東西？

另外，還要注意根據文檔：

如果你使用的文字，那麼你給出名稱的任何值和關鍵字參數被忽略。

所以你不需要通過'td'，因爲它已經被忽略了，也就是說，任何其他標籤下匹配的文本都會被返回。

來源

2012-01-06 11:16:39 jcollado

美麗的湯 - 抓取第一個指定標記後的字符串

回答

相關問題