2011-01-09 101 views
0

嗨,大家好我有解析XML文件並輸入數據到sqlite時的問題,格式就像我需要在象111,AAA,BBB等令牌之前輸入字符在XML中解析數據並在Python中存儲到數據庫

<DOCUMENT> 
<PAGE width="544.252" height="634.961" number="1" id="p1"> 
<MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/> 

<BLOCK id="p1_b1"> 

<TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652"> 
<TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN> 
</TEXT> 
</BLOCK> 

<BLOCK id="p1_b3"> 

<TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096"> 
<TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes"  italic="yes">AAA</TOKEN> 
<TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN> 
<TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN> 
</TEXT> 
</BLOCK> 

<BLOCK id="p1_b4"> 

<TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026"> 
<TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN> 
<TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN> 
</TEXT> 

<TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026"> 
<TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN> 
</TEXT> 

<TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026"> 
<TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN> 
<TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN> 
</TEXT> 
</BLOCK> 
</PAGE> 
</DOCUMENT> 
在.NET

它與3的foreach循環做1.「文檔/頁/塊」 2「TEXT」 3.「令牌」,然後將其輸入到DB我不知道要怎麼弄它在Python和我與LXML模塊嘗試它

+0

您的意思是您需要獲取所有標記值?像['111','BBB','EEE']或[['111'],['BBB','EEE']] – virhilo 2011-01-09 10:40:23

回答

1

你的意思是這個?:

>>> xml = """<DOCUMENT> 
... <PAGE width="544.252" height="634.961" number="1" id="p1"> 
... <MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/> 
... 
... <BLOCK id="p1_b1"> 
... 
... <TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652"> 
... <TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN> 
... </TEXT> 
... </BLOCK> 
... 
... <BLOCK id="p1_b3"> 
... 
... <TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096"> 
... <TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes"  italic="yes">AAA</TOKEN> 
... <TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN> 
... <TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN> 
... </TEXT> 
... </BLOCK> 
... 
... <BLOCK id="p1_b4"> 
... 
... <TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026"> 
... <TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN> 
... <TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN> 
... </TEXT> 
... 
... <TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026"> 
... <TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN> 
... </TEXT> 
... 
... <TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026"> 
... <TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN> 
... <TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN> 
... </TEXT> 
... </BLOCK> 
... </PAGE> 
... </DOCUMENT>""" 
>>> from lxml import etree 
>>> parsed = etree.fromstring(xml) 
>>> tokens = parsed.xpath('//TOKEN/text()') 
>>> tokens 
['111', 'AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG', 'HHH'] 
>>> 

or this ?:

>>> parsed = etree.fromstring(xml) 
>>> for block in parsed.xpath('//PAGE/BLOCK/TEXT'): 
...  print block.xpath('./TOKEN/text()') 
... 
['111'] 
['AAA', 'BBB', 'CCC'] 
['DDD', 'EEE'] 
['FFF'] 
['GGG', 'HHH'] 
>>> 
+0

我用同樣的方法嘗試過,但我得到了一個空的列表, 。沒有添加到「/ TOKEN/text()」爲什麼你添加點它做了什麼.....無論如何感謝很多老兄 – Rakesh 2011-01-09 14:15:48