lxml：解析html，無法獲取節點

我想開始使用lxml解析html。我知道從基本xpath /應該選擇根節點，//body應該選擇身體元素節點，無論它在dom中，無論它在哪裏，但是我得到一個空列表的所有。lxml：解析html，無法獲取節點

from lxml import html 
import urllib2 
headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0'} 
req = urllib2.Request("http://news.ycombinator.com", None, headers) 
r = urllib2.urlopen(req).read() 
x = html.fromstring(r) 
x.xpath("/") 
[]

編輯：

例如，下面是該頁面的另一個有效的XPath表達式返回一個空列表

x.xpath("/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[1]/td[3]") 
[] 
# when it should have returned the following (as of this time) 
# <td class="title"><a href="http://www.tomdalling.com/blog/modern-opengl/opengl-in-2014/">OpenGL in 2014</a><span class="comhead"> (tomdalling.com) </span></td>

來源

2014-09-21 yayu

難道你沒有得到這個** urllib2.HTTPError：HTTP錯誤403：禁止** – Nabin 2014-09-21 10:29:04

** [] **做什麼？ – Nabin 2014-09-21 10:29:24

@Nabin哦，在實際的代碼中，我使用了一個代理和一個假的用戶代理，我沒有發佈。 '[]'是最後一行的輸出。我會讓這個代碼可行，只需一分鐘。 – yayu 2014-09-21 10:30:55

關於你提到的第二個問題：與XPath表達式可能是問題tbody元素。正如你已經可以找到Stackoverflow上的類似問題的多個問題 - 例如，這裏Why do browsers insert tbody element into table elements?和這裏Why does firebug add <tbody> to <table>?，短版本是瀏覽器添加像例如頭和tbody到的源代碼是而不是，所以xpath不會匹配。你可以省略TBODY：

x.xpath("/html/body/center/table/tr[3]/td/table/tr[1]/td[3]")

這似乎工作的規定在這裏：Extracting lxml xpath for html table

但我喜歡的第一個答案在這裏給出Python lxml XPath problem，的方法 - 它也應該工作，如果你只是忽略中的XPath的不必要的部分，縮短查詢到你要找的元素，所以不是

x.xpath("/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[1]/td[3]")

你應該得到的結果

x.xpath("/html/tr[3]/tr[1]/td[3]")

來源

2014-09-21 20:10:16

lxml：解析html，無法獲取節點

回答

相關問題