2012-07-31 119 views
2

我想用lxml和xpath使用python解析值表單html。在Python中使用lxml解析HTML,xpath

這裏是我的HTML數據

<table> 
<tr> 
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td> 
     <td class="u"> 
     <select name="record[13][type]"> 
     <option SELECTED value="A" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td> 

<td class="u"><input class="wide" name="record[14][name]" value="exampledomain2.com"></td> 
     <td class="u"> 
     <select name="record[14][type]"> 
     <option SELECTED value="CNAME" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[14][content]" value='exampledomain1.com'></td> 

<td class="u"><input class="wide" name="record[15][name]" value="exampledomain3.com"></td> 
     <td class="u"> 
     <select name="record[15][type]"> 
     <option SELECTED value="A" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[15][content]" value='10.10.10.3'></td> 
</tr> 
</table> 

我要的是解析值和打印如下:

exampledomain1.com A 10.10.10.1 
exampledomain2.com CNAME exampledomain1.com 
exampledomain3.com A 10.10.10.3 

這裏是我試過

#!/usr/bin/python 
import lxml.html 
from lxml import etree 

doc = lxml.html.document_fromstring("""Here whole html data""") 
txt1 = doc.xpath('//*[@class="wide"]/@value') 
txt2 = doc.xpath('//@SELECTED/text()') 
print txt1 
print txt2 

但它不是按我想要的方式工作。任何幫助,將不勝感激。

謝謝大家。

+4

運行「xmllint --noout在您的HTML報告7個錯誤。在解析它之前,你應該修復它們。 – 2012-07-31 16:33:17

+0

它如何「不按你想要的」工作? – 2012-07-31 17:11:49

+1

使用BeautifulSoup ..它的簡單和容易 – Surya 2012-08-01 14:55:47

回答

3

我固定的代碼返回以下,這是非常接近你的要求爲:

(py26_default)[[email protected] ~]$ python parse.py 
exampledomain1.com 10.10.10.1 
exampledomain2.com exampledomain1.com 
exampledomain3.com 10.10.10.3 
(py26_default)[[email protected] ~]$ 

無法檢索record[13][type]使用XPath ......還有其他的方式,通過這個迭代,但我將這作爲OP的練習。請注意,我沒有固定的OP的問題HTML包括<table><tr>標籤...

import lxml.html 
from lxml import etree 
from lxml.etree import XMLParser 

parser = XMLParser(ns_clean=True, recover=True) 
doc = etree.fromstring("""Here whole html data""", parser) 
elem1 = doc.xpath('//input[@name="record[13][name]"]') 
# NOTE: <option SELECTED> cannot be retrieved with xpath... SELECTED must have 
# a value to do so... 
#elem2 = doc.xpath('//select[@name="record[13][type]"]/option[@SELECTED]') 
elem3 = doc.xpath('//input[@name="record[13][content]"]') 

for idx, val in enumerate(elem1): 
    print val.attrib['value'], elem3[idx].attrib['value'] 

<!-- The (fixed) html source I used --> 
<table> 
<tr> 
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td> 
     <td class="u"> 
     <select name="record[13][type]"> 
     <option SELECTED value="A" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td> 

<td class="u"><input class="wide" name="record[13][name]" value="exampledomain2.com"></td> 
     <td class="u"> 
     <select name="record[13][type]"> 
     <option SELECTED value="CNAME" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[13][content]" value='exampledomain1.com'></td> 

<td class="u"><input class="wide" name="record[13][name]" value="exampledomain3.com"></td> 
     <td class="u"> 
     <select name="record[13][type]"> 
     <option SELECTED value="A" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.3'></td> 
</tr> 
</table> 
+0

嗨邁克,字段「name =」記錄[13]「正在改變所有這些其他dns記錄記錄,我已糾正在這個html代碼中,所以在這種情況下,/input [@ name =「record [13] [name]」]'不會捕獲所有不同數字的記錄,所以我可以在其中定義通配符或範圍。 – Manish 2012-08-01 15:01:42

+0

您可以使用[lxml'正則表達式]( http://stackoverflow.com/a/2756994/667301)解決這個問題 – 2012-08-01 15:26:26

+0

謝謝你邁克,那麼我得到了與正則表達式工作,但仍然堅持獲得SELECTED值 – Manish 2012-08-02 16:13:49

0
record_13_name = tree.xpath("//select[@name='record[13][name]']/text()") 
record_13_type = tree.xpath("//select[@name='record[13][type]']/option/text()") 
record_13_content = tree.xpath("//input[@name='record[13][content]']/text()") 


record_14_name = tree.xpath("//select[@name='record[14][name]']/text()") 
record_14_type = tree.xpath("//select[@name='record[14][type]']/option/text()") 
record_14_content = tree.xpath("//input[@name='record[14][content]']/text()") 


record_15_name = tree.xpath("//select[@name='record[15][name]']/text()") 
record_15_type = tree.xpath("//select[@name='record[15][type]']/option/text()") 
record_15_content = tree.xpath("//input[@name='record[15][content]']/text()")