2016-12-03 81 views
-1

我給出了一個指向HTML頁面的鏈接。如何打開它並使用其絕對XPath獲取特定元素的內容。使用Python提取HTML頁面元素的內容

from lxml import html 
import requests 
page = requests.get('http://www.professorpaddle.com/rivers/riverlist.asp') 
tree = html.fromstring(page.content) 
table_data=[] 
temp_dict={} 
temp = tree.xpath('//a[@class="pathm"]') 
for i in temp: 
    link=i.attrib.get('href') 
    link="http://www.professorpaddle.com/rivers/"+link 
    temp_dict['name']=i.text 
    temp_dict['link']=link 
    print(link) 
    temp_page=requests.get(link) 
    temp_tree=html.fromstring(temp_page.content) 
    x=temp_tree.xpath('/html/body/element/table/tbody/tr[2]/td/table/tbody/tr/td/table[1]/tbody/tr[2]/td[3]/table/tbody/tr[3]/td[2]/font') 
    print(x) 
    break 
+3

你嘗試的東西嗎? – Dekel

+0

是的,但我如何發佈我的代碼? – FibonacciCoder

+0

選中此項:http://stackoverflow.com/editing-help – Dekel

回答

1

xpath似乎無法找到tbody的。我還試圖簡化xpath搜索字符串以使我更容易。當我這樣做的時候,我發現不久之前我發現其中一個類有兩個拼寫。這是我的一頁。

>>> URL = 'http://www.professorpaddle.com/rivers/riverdetails.asp?riverid=350' 
>>> from lxml import html 
>>> import requests 
>>> page = requests.get(URL) 
>>> tree = html.fromstring(page.content) 
>>> tableRows = tree.xpath('..//table[@class="tableBorder" or @class="tableborder"][2]/tr') 
>>> len(tableRows) 
2 
>>> for row in tableRows: 
...  for child in row.iterchildren(): 
...   if child.text: 
...    child.text.strip() 
...    
'Pinned Forum Threads' 
'' 
'' 

差點忘了,我會更喜歡使用比賽但顯然在這個實現的XPath不提供正則表達式。

補充,在迴應評論:

>>> fontItems = tree.xpath('..//table[@class="tableBorder" or @class="tableborder"][1]/tr/td/font[@class="path"]') 
>>> len(fontItems) 
12 
>>> for item in fontItems: 
...  list(item.itertext()) 
...  
['GPS/GIS'] 
['Maps'] 
['Put In Longitude : '] 
['-121.29268'] 
['Put In Latitude : '] 
['47.8034515'] 
['Take Out Longitude : '] 
['-121.33998'] 
['Take Out Latitude : '] 
['47.7137985'] 
['County : '] 
['Snohomish'] 
+0

http://stackoverflow.com/questions/40949270/extracting-just-sibling-element-in-xpath – FibonacciCoder

+0

請回答這個問題 – FibonacciCoder

+0

請參閱編輯。 ------ –