美麗的湯：獲取子節點的內容

我有以下Python代碼：美麗的湯：獲取子節點的內容

def scrapeSite(urlToCheck): 
    html = urllib2.urlopen(urlToCheck).read() 
    from BeautifulSoup import BeautifulSoup 
    soup = BeautifulSoup(html) 
    tdtags = soup.findAll('td', { "class" : "c" }) 
    for t in tdtags: 
      print t.encode('latin1')

這將返回我下面的HTML代碼：

<td class="c"> 
<a href="more.asp">FOO</a> 
</td> 
<td class="c"> 
<a href="alotmore.asp">BAR</a> 
</td>

我想獲得之間的文本a-節點（例如FOO或BAR），這將是t.contents.contents。不幸的是，它並不容易:) 有沒有人有一個想法如何解決？

非常感謝，任何幫助表示讚賞！

乾杯，約瑟夫

來源

2010-10-21 Joseph jun. Melettukunnel

在這種情況下，你可以使用t.contents[1].contents[0]得到FOO和BAR。

的事情是，內容將返回所有元素（標籤和NavigableStrings）的列表，如果要打印的內容，你可以看到它像

[u'\n', <a href="more.asp">FOO</a>, u'\n']

因此，要以實際的標籤你需要訪問contents[1]（如果您的內容完全相同，則可能因源HTML而異），找到合適的索引後，您可以使用contents[0]獲取標籤中的字符串。

現在，這取決於HTML源的確切內容，它非常脆弱。更通用和健壯的解決方案是再次使用find()通過t.find('a')找到'a'標籤，然後使用內容列表獲取其中的值t.find('a').contents[0]或僅僅t.find('a').contents以獲取整個列表。

來源

2010-10-21 13:14:11

不能這樣做，這是錯誤信息： AttributeError：'NavigableString'對象沒有屬性'內容' – 2010-10-21 13:17:39

@Joseph：我測試過了，它適用於BeautifulSoup 3.0.4，Python 2.5 .. If它可能不適合你在實際內容列表中有不同的內容。我用更通用的解決方案編輯了答案。 – 2010-10-21 13:18:26

t.find（'a'）。contents [0] -part訣竅:)非常感謝你 – 2010-10-21 13:26:41

爲了您的具體的例子，pyparsing的makeHTMLTags是有用的，因爲它們是寬容的HTML標籤的許多HTML變異的，但對結果提供了一個方便的結構：

html = """ 
<td class="c"> 
<a href="more.asp">FOO</a> 
</td> 
<td class="c"> 
<a href="alotmore.asp">BAR</a> 
</td> 
<td class="d"> 
<a href="alotmore.asp">BAZZ</a> 
</td> 
""" 

from pyparsing import * 

td,tdEnd = makeHTMLTags("td") 
a,aEnd = makeHTMLTags("a") 
td.setParseAction(withAttribute(**{"class":"c"})) 

pattern = td + a("anchor") + SkipTo(aEnd)("aBody") + aEnd + tdEnd 

for t,_,_ in pattern.scanString(html): 
    print t.aBody, '->', t.anchor.href

打印：

FOO -> more.asp 
BAR -> alotmore.asp

來源

2010-10-21 13:27:43 PaulMcG

謝謝解決方案！我還沒有使用pyparsing，但我一定會檢查出未來的問題。 – 2010-10-21 15:51:11

美麗的湯：獲取子節點的內容

回答

相關問題