我如何獲得一個HTML元素與Python LXML

有這樣的html代碼：我如何獲得一個HTML元素與Python LXML

<table> 
<tr> 
    <td class="test"><b><a href="">aaa</a></b></td> 
    <td class="test">bbb</td> 
    <td class="test">ccc</td> 
    <td class="test"><small>ddd</small></td> 
</tr> 
<tr> 
    <td class="test"><b><a href="">eee</a></b></td> 
    <td class="test">fff</td> 
    <td class="test">ggg</td> 
    <td class="test"><small>hhh</small></td> 
</tr> 
</table>

我用這個Python代碼與LXML模塊提取所有<td class="test">。

import urllib2 
import lxml.html 

code = urllib.urlopen("http://www.example.com/page.html").read() 
html = lxml.html.fromstring(code) 
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')

它很好用！其結果是：

<td class="test"><b><a href="">aaa</a></b></td> 
<td class="test"><small>ddd</small></td> 


<td class="test"><b><a href="">eee</a></b></td> 
<td class="test"><small>hhh</small></td>

（因此第一和的每個<tr>第四列）現在，我來提取：

AAA（鏈接的標題）

ddd（<small>標籤之間的文本）

EEE（鏈接的標題）

HHH（<small>標籤之間的文本）

我怎麼能提取這些價值？

（問題是，我不得不刪除<b>標籤，並獲得錨標題上第一列和第四列刪除<small>標籤）

謝謝！

來源

2010-05-10 Damiano

爲什麼不直接在每一步獲取你想要的東西？

links = [el.text for el in html.xpath('//td[@class="test"][position() = 1]/b/a')] 
smalls = [el.text for el in html.xpath('//td[@class="test"][position() = 4]/small')] 
print zip(links, smalls) 
# => [('aaa', 'ddd'), ('eee', 'hhh')]

來源

2010-05-11 01:20:10

如果你這樣做el.text_content()，你會從每個元素去除所有標籤的東西，即：

result = [el.text_content() for el in result]

來源

2010-05-11 02:13:07

我如何獲得一個HTML元素與Python LXML

回答

相關問題