2011-03-02 166 views
2

我想解析一個XML文件,我通過使用adobe pro將PDF導出到xml 1.0。 我正在使用Python和ElementTree來解析。 pdf包含一個跨越多個頁面並具有多個不同表格標題的表格。我想要解析和提取表中的行和列數據,它以包含特定字符串的標題(例如「MECHANICAL」)開頭,並停在下一個表標題部分(例如「COMPLETED」)。從而排除本部分之前和之後的所有行和列數據。有沒有簡單的標籤來解析,標籤模式只是重複。Python ElementTree XML解析

這是我目前的Python代碼:

# Python 

import sys 
import re  # regular expression 
import xml.etree.ElementTree as xml 

tree = xml.parse("C:/Documents and Settings/alilly.CORPORATE/Desktop/python xml parse/excerpt.xml") 

print "=================== Find Columns ===================="  

for node in tree.iter('TR'): 

    print "tag=",node.tag 

    count = len(node.getiterator('TD')) 

    #if count != 10: 
    # continue 

    print "------------" 

    for col in node.getiterator('TD'): 
     print "  tag=",col.tag, "attrib=", col.attrib, "text=", col.text 


print "=================== Find Headers ====================" 

# find headers 
for node in tree.iter('ImageData'): 
    print "figure text = ", node.tail 

這裏是我的XML文件:

<?xml version="1.0" encoding="UTF-8" ?> 
<!-- Created from PDF via Acrobat SaveAsXML --> 
<!-- Mapping Table version: 28-February-2003 --> 
<TaggedPDF-doc> 
<?xpacket begin='?' id='W5M0MpCehiHzreSzNTczkc9d'?> 
<?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?> 
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00  "> 
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> 
     <rdf:Description rdf:about="" 
      xmlns:pdf="http://ns.adobe.com/pdf/1.3/"> 
     <pdf:Producer>GPL Ghostscript 8.70</pdf:Producer> 
     <pdf:Keywords/> 
     </rdf:Description> 
     <rdf:Description rdf:about="" 
      xmlns:xmp="http://ns.adobe.com/xap/1.0/"> 
     <xmp:ModifyDate>2011-03-01T09:36:13-05:00</xmp:ModifyDate> 
     <xmp:CreateDate>2011-03-01T09:36:13-05:00</xmp:CreateDate> 
     <xmp:CreatorTool>PDFCreator Version 1.0.2</xmp:CreatorTool> 
     </rdf:Description> 
     <rdf:Description rdf:about="" 
      xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"> 
     <xmpMM:DocumentID>d417764e-466c-11e0-0000-f7ea6a538d79</xmpMM:DocumentID> 
     <xmpMM:InstanceID>uuid:0c6ada50-6db0-4d59-88e1-fc23aa6ebc14</xmpMM:InstanceID> 
     </rdf:Description> 
     <rdf:Description rdf:about="" 
      xmlns:dc="http://purl.org/dc/elements/1.1/"> 
     <dc:format>xml</dc:format> 
     <dc:title> 
      <rdf:Alt> 
       <rdf:li xml:lang="x-default">my pdf file</rdf:li> 
      </rdf:Alt> 
     </dc:title> 
     <dc:creator> 
      <rdf:Seq> 
       <rdf:li>ltamm</rdf:li> 
      </rdf:Seq> 
     </dc:creator> 
     <dc:description> 
      <rdf:Alt> 
       <rdf:li xml:lang="x-default"/> 
       <rdf:li xml:lang="x-repair"/> 
      </rdf:Alt> 
     </dc:description> 
     </rdf:Description> 
    </rdf:RDF> 
</x:xmpmeta> 
<?xpacket end="w"?> 
<?xpacket end='r'?> 
<Part> 
<H1>Misc </H1> 
<Sect> 
<H3>This is a test </H3> 
<Sect> 
<H5>Deletions </H5> 
<L> 
<LI> 
<LI_Title>Special codes </LI_Title> 
</LI> 
</L> 
<Figure> 
<ImageData src=""/> 
</Figure> 
<Figure> 
<ImageData src=""/> 
Main INTERIOR </Figure> 
<Table> 
<TR> 
<TH>S = Standard O = Optional </TH> 
</TR> 
<TR> 
<TD><Figure> 
<ImageData src=""/> 
</Figure> 
</TD> 
<TD>S </TD> 
</TR> 
</Table> 
<Figure> 
<ImageData src=""/> 
This is the MECHANICAL header</Figure> 
<Table> 
<TR> 
<TH>S = Standard O = Optional </TH> 
</TR> 
<TR> 
<TH>Free Flow </TH> 
<TD>Ref. Code </TD> 
<TD>DESCRIPTION </TD> 
<TD>Rooster </TD> 
<TD>747 Dog </TD> 
<TD>888 Rabbit </TD> 
</TR> 
<TR> 
<TD>xxx GOgo xxB </TD> 
<TD>Beany xxx </TD> 
<TD>nothing here xxx </TD> 
<TD>xxx B </TD> 
<TD>snake ddd </TD> 
<TD>Cow fff </TD> 
<TD>eee </TD> 
</TR> 
<TR> 
<TH/> 
<TD/> 
<TD>Squirrel Protection </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
</TR> 
<TR> 
<TH/> 
<TD>J77 </TD> 
<TD>Rocket Launcher </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
</TR> 
<TR> 
<TH/> 
<TD/> 
<TD>Lunch </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
<TD>S </TD> 
</TR> 
<TR> 
<TH/> 
<TD>Jss5 </TD> 
<TD>Now is the time for all good men </TD> 
<TD>-</TD> 
<TD>A1 </TD> 
<TD>A1 </TD> 
<TD>-</TD> 
<TD>-</TD> 
<TD>-</TD> 
<TD>-</TD> 
</TR> 
<TR> 
<TD>Capacity </TD> 
<TD/> 
<TD>2/3 </TD> 
<TD>2/3 </TD> 
<TD>2/3 </TD> 
</TR> 
</Table> 
<Figure> 
<ImageData src=""/> 
Final COMPLETED PAGE 1 OF 2 </Figure> 
<Figure> 
<ImageData src=""/> 
</Figure> 
<P>Graphite </P> 
<P>painted fun </P> 
<P>Control yourself </P> 
<Figure> 
<ImageData src=""/> 
Meaningless Header PAGE 2 OF 2 </Figure> 
<Figure> 
<ImageData src=""/> 
</Figure> 
<P>)multi-coat </P> 
<P>front</P> 
<P>single-slot system </P> 
<Figure> 
<ImageData src=""/> 
Almost Done Header PAGE 1 OF 1 </Figure> 
<Figure> 
<ImageData src=""/> 
</Figure> 
<Figure> 
<ImageData src=""/> 
</Figure> 
<Figure> 
<ImageData src=""/> 
</Figure> 
<P>Snow Blizzard. </P> 
<P>Done </P> 
</Sect> 
</Sect> 
</Part> 
</TaggedPDF-doc> 
+0

「import xml.etree.ElementTree as xml」是一個壞主意;你只是破壞了標準的xml包名字空間。更好地將其導入爲「ET」,或者與已知的包或模塊名稱不衝突的東西。 – 2011-06-15 19:40:12

回答

4

在我需要保持狀態的情況下,我回落到一個SAX風格的XML -parser,這裏是一個簡單的腳本,它簡單地在你的MECHANICAL和COMPLETED數字之間拉動行。

#!python 
import xml.sax 
import xml.sax.handler 

class Handler(xml.sax.handler.ContentHandler): 
    def __init__(self): 
     self.l_ch = list() 
     self.__in_mechanical = False 

    def startElement(self, name, attrs): 
     if name == 'TR': 
      self.l_rows = list() 

    def characters (self, ch): 
     self.l_ch += ch 

    def endElement(self, name): 
     if self.l_ch: 
      ch = ''.join(self.l_ch).strip() 

     if name == 'Figure': 
      if ch.find('MECHANICAL') >= 0: 
       self.__in_mechanical = True 
      elif ch.find('COMPLETED') >= 0: 
       self.__in_mechanical = False 

     elif name == 'TD' and self.__in_mechanical: 
      self.l_rows.append(ch) 

     elif name == 'TR' and self.__in_mechanical: 
      print 'Row:', self.l_rows 
      self.l_rows = list() 

     self.l_ch = list() 

parser = xml.sax.make_parser() 
parser.setContentHandler(Handler()) 
parser.parse(open('sample.xml')) 

這給了我下面的結果,並應該讓你去更復雜。

Row: [] 
Row: [u'Ref. Code', u'DESCRIPTION', u'Rooster', u'747 Dog', u'888 Rabbit'] 
Row: [u'xxx GOgo xxB', u'Beany xxx', u'nothing here xxx', u'xxx B', u'snake ddd', u'Cow fff', u'eee'] 
Row: [u'', u'Squirrel Protection', u'S', u'S', u'S', u'S', u'S', u'S', u'S'] 
Row: [u'J77', u'Rocket Launcher', u'S', u'S', u'S', u'S', u'S', u'S', u'S'] 
Row: [u'', u'Lunch', u'S', u'S', u'S', u'S', u'S', u'S', u'S'] 
Row: [u'Jss5', u'Now is the time for all good men', u'-', u'A1', u'A1', u'-', u'-', u'-', u'-'] 
Row: [u'Capacity', u'', u'2/3', u'2/3', u'2/3'] 
0

什麼你試圖選擇不清楚你的描述。 聽起來好像要處理包含字符串「MECHANICAL」和「COMPLETED」的 元素之間的所有元素。 (在這個例子中,這只是一個單一的表,但我相信它可能是 表的任意數。)

如果你可以使用lxml的,你可以選擇使用XPath。

from lxml import etree 
x = etree.parse(file('mech.xml')) 
# select Tables following "MECHANICAL" : 
fol = x.xpath('//Figure[contains(., "MECHANICAL")]/following-sibling::Table[1]') 
# [<Element Table at 101532ec0>] 
# select Tables preceding "COMPLETED" : 
pre = x.xpath('//Figure[contains(.,"COMPLETED")]/preceding-sibling::Table') 
# [<Element Table at 101532d08>, <Element Table at 101532ec0>] 
# get their intersection: 
tables = [ e for e in fol if e in pre ] 
for t in tables: 
    for tr in t.xpath('TR'): 
     # [ ... process ... ]