2017-06-19 83 views
0

我正在學習Python,並試圖從任何XML文件中提取所有標籤和相應值的列表。這是我的代碼到目前爲止。使用Python將XML轉換爲標籤和值列表

def ParseXml(XmlFile): 
    try: 
     parser = etree.XMLParser(remove_blank_text=True, compact=True) 
     tree = ET.parse(XmlFile, parser) 
     root = tree.getroot() 

     ListOfTags, ListOfValues, ListOfAttribs = [], [], [] 
     for elem in root.iter('*'): 
      Tag = elem.tag 
      ListOfTags.append(Tag) 

      value = elem.text 
      if value is not None: 
       ListOfValues.append(value) 
      else: 
       ListOfValues.append('') 

      attrib = elem.attrib 
      if attrib: 
       ListOfAttribs.append([attrib]) 
      else: 
       ListOfAttribs.append([]) 
     print('%s File parsed successfully' % XmlFile) 
     return (ListOfTags, ListOfValues, ListOfAttribs) 

    except Exception as e: 
     print('Error while parsing XMLs : %s : %s' % (type(e), e)) 
     return ([], [], []) 

對於像這樣的XML輸入:

<?xml version="1.0" encoding="UTF-8"?> 
<Application Version="2.01"> 
    <UserAuthRequest> 
     <VendorApp> 
      <AppName>SING</AppName> 
     </VendorApp> 
    </UserAuthRequest> 
    <ApplicationRequest ID="12-123-AH"> 
     <GUID>ABD45129-PD1212-121DFL</GUID> 
     <Type tc="200">Streaming</Type> 
     <File></File> 
     <FileExtension VendorCode="200"> 
      <Result> 
       <ResultCode tc="1">Success</ResultCode> 
      </Result> 
     </FileExtension> 
    </ApplicationRequest> 
</Application> 

此輸出的標記,值和屬性多個列表。這工作正常。

['Application', 'UserAuthRequest', 'VendorApp', 'AppName', 'ApplicationRequest', 'GUID', 'Type', 'File', 'FileExtension', 'Result', 'ResultCode'] 
['', '', '', 'SING', '', 'ABD45129-PD1212-121DFL', 'Streaming', '', '', '', 'Success'] 
[[{'Version': '2.01'}], [], [], [], [{'ID': '12-123-AH'}], [], [{'tc': '200'}], [], [{'VendorCode': '200'}], [], [{'tc': '1'}]] 

但我的問題是,我需要標籤,包括父母和孩子的標籤。像下面的實際輸出我靶向:

['Application', 'UserAuthRequest', 'UserAuthRequest.VendorApp', 'UserAuthRequest.VendorApp.AppName', 'ApplicationRequest', 'ApplicationRequest.GUID', 'ApplicationRequest.Type', 'ApplicationRequest.File', 'ApplicationRequest.File.FileExtension', 'ApplicationRequest.File.FileExtension.Result', 'ApplicationRequest.File.FileExtension.Result.ResultCode'] 

我如何做到這一點與Python?還是有其他的替代方法來做到這一點?

+1

你嘗試過使用BeautifulSoup嗎? – snapcrack

+0

我在某處讀到它與lxml類似的地方。是否有可能使用BeautifulSoup獲得所需的輸出?如果是這樣,怎麼樣? – Naveen

+0

目標輸出似乎不一致,至少對於根節點的孩子來說;他們應該是'Application.UserAuthRequest'和'Application.ApplicationRequest'。另外,_xml_中沒有'ApplicationRequest.File。*'。 – CristiFati

回答

0

下面是僅使用[Python]: xml.etree.ElementTree — The ElementTree XML API遞歸的方法:

import xml.etree.ElementTree as ET 


def parse_node(node, ancestor_string=""): 
    #print(type(node), dir(node)) 

    if ancestor_string: 
     node_string = ".".join([ancestor_string, node.tag]) 
    else: 
     node_string = node.tag 
    tag_list = [node_string] 
    text = node.text 
    if text: 
     text_list = [text.strip()] 
    else: 
     text_list = [""] 
    attr_list = [node.attrib] 
    for child_node in list(node): 
     child_tag_list, child_text_list, child_attr_list = parse_node(child_node, ancestor_string=node_string) 
     tag_list.extend(child_tag_list) 
     text_list.extend(child_text_list) 
     attr_list.extend(child_attr_list) 
    return tag_list, text_list, attr_list 


def parse_xml(file_name): 
    tree = ET.parse("test.xml") 
    root_node = tree.getroot() 
    tags, texts, attrs = parse_node(root_node) 
    print(tags) 
    print(texts) 
    print(attrs) 


def main(): 
    parse_xml("a.xml") 


if __name__ == "__main__": 
    main() 

注意

  • 的想法是「記住路徑「中的xml樹。這是通過parse_nodeancestor_string的說法,這是計算了樹中的每個節點,並傳遞給它(直接)孩子做
  • 的命名來自一個不同的問題,因爲[Python]: PEP 8 -- Style Guide for Python Code考慮
  • 在1 ST一目瞭然,它似乎是有兩個函數(mainparse_xml),其中一個只是調用別的,只增加了嵌套的無用的水平,但它是一個好的做法,我習慣了
  • 我糾正的屬性列表。相反,含有單字典中的每個內部列表列表的列表,返回字典列表

輸出(我和的Python 2.7的Python 3.5運行腳本):

['Application', 'Application.UserAuthRequest', 'Application.UserAuthRequest.VendorApp', 'Application.UserAuthRequest.VendorApp.AppName', 'Application.ApplicationRequest', 'Application.ApplicationRequest.GUID', 'Application.ApplicationRequest.Type', 'Application.ApplicationRequest.File', 'Application.ApplicationRequest.FileExtension', 'Application.ApplicationRequest.FileExtension.Result', 'Application.ApplicationRequest.FileExtension.Result.ResultCode'] 
['', '', '', 'SING', '', 'ABD45129-PD1212-121DFL', 'Streaming', '', '', '', 'Success'] 
[{'Version': '2.01'}, {}, {}, {}, {'ID': '12-123-AH'}, {}, {'tc': '200'}, {}, {'VendorCode': '200'}, {}, {'tc': '1'}] 
0

我相信這是你所需要的:

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

soup = BeautifulSoup(yourlinkhere, 'lxml') 

lst = [] 

for tag in soup.findChildren(): 
    if tag.child: 
     lst.append(str(tag.name) + '.' + str(tag.child)) 
    else: 
     lst.append(tag.name)