2017-08-02 43 views
1

嗨,我可以將我的xml文件轉換爲熊貓數據框。但我面臨的挑戰是我沒有在正確的行中獲取記錄,可以說我們在xml中有一組標記,例如它正在重複使用。 4倍,它有多個子節點應該是我的數據框的列,現在當我想讀取XML我想要只在我的熊貓數據框中只有4行,但我得到太多與NaN行,因爲所有其他標籤躺在不同的水平上。python中的XML解析熊貓在一行中獲取完整的標記塊

編輯:剛纔弄清楚了XML數據的描述/差異。提到的一個是最終編輯的XML數據 只需找出我的XML數據的一些問題...更新了正確和最終的XML內容。

Same <ns1:parenttag> is getting repeated over a xml file multiple times 

    <?xml version="1.0" encoding="UTF-8"?> 
    <row:user-agents xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:row="http://www.row.com" 
    xmlns:ns1="http://www.ns1.com" 
    xmlns:ns2="http://www.ns2.com" 
    xmlns:ns3="http://www.ns3.com" 
    xmlns:row1="http://www.row1.com" 
    xmlns:row3="http://www.row3.com" 
    xmlns:row2="http://www.row2.com" 
    xsi:schemaLocation="http://www.schemaLocation-1.4.xsd"> 

<row:agent1> 
<row:test> 
    <row2:test1> 
    <row2:test2> 
     <row2:test3>9999</row2:test3> 
     <row2:test4>aa</row2:test4> 
     <row2:test5>1</row2:test5> 
    </row2:test2> 
    </row2:test1> 
    <row2:test6>2017</row2:test6> 
</row:test> 
<row:agent2> 
<row3:agent3> 

     <ns1:parenttag> 
      <ns1:childtag1> 
       <ns1:subchildtag1> 
        <ns1:indenticaltag>123</ns1:indenticaltag> 
       </ns1:subchildtag1> 
      </ns1:childtag1> 
      <ns1:indenticaltag>456</ns1:indenticaltag> 
      <ns1:childtag2>N</ns1:childtag2> 
      <ns1:childtag3>0</ns1:childtag3> 
      <ns1:childtag4>N</ns1:childtag4> 
      <ns1:childtag5> 
       <ns2:subchildtag2 attributname="abc"> 
        <ns2:sub_subchildtag1>12 45</ns2:sub_subchildtag1> 
       </ns2:subchildtag2> 
      </ns1:childtag5> 
      <ns1:childtag6>tyu</ns1:childtag6> 
      <ns1:childtag7>2</ns1:childtag7> 
      <ns1:childtag8> poiu</ns1:childtag8> 
      <ns1:childtag9> 
       <ns3:subchildtag3>345</ns3:subchildtag3> 
       <ns3:subchildtag6>567</ns3:subchildtag6> 

      </ns1:childtag9> 
      <ns1:childtag10>N</ns1:childtag10> 
      <ns1:childtag11> 
       <ns3:subchildtag4>34</ns3:subchildtag4> 
       <ns3:subchildtag5>abc/123</ns3:subchildtag5> 
      </ns1:childtag11> 
      <ns1:childtag12> 
       <ns1:indenticaltag>234</ns1:indenticaltag> 
      </ns1:childtag12> 
     </ns1:parenttag> 

</row3:agent3> 
</row:agent2> 
</row:agent1> 
</row:user-agents> 

另一個XML這是父標籤的期限有所不同:

 <ns1:parenttag> 
      <ns1:indenticaltag>123</ns1:indenticaltag> 
      <ns1:childtag2>N</ns1:childtag2> 
      <ns1:childtag3>0</ns1:childtag3> 
      <ns1:childtag4>N</ns1:childtag4> 
      <ns1:childtag5> 
       <ns2:subchildtag1 attributename0="poi"> 
        <ns2:sub_subchildtag1> 
         <ns2:sub_sub_subchildtag1> 
          <ns2:sub_sub_sub_subchildtag1 attributename1="3" attributename2="17">1234</ns2:sub_sub_sub_subchildtag1> 
         </ns2:sub_sub_subchildtag1> 
        </ns2:sub_subchildtag1> 
       </ns2:subchildtag1> 
      </ns1:childtag5> 
      <ns1:childtag6>12</ns1:childtag6> 
      <ns1:childtag7> qwer</ns1:childtag7> 
      <ns1:childtag8> 
       <ns3:subchildtag2>456</ns3:subchildtag2> 
      </ns1:childtag8> 
      <ns1:childtag9>N</ns1:childtag9> 
      <ns1:childtag10> 
       <ns3:subchildtag3>908</ns3:subchildtag3> 
       <ns3:subchildtag4>abc/123</ns3:subchildtag4> 
      </ns1:childtag10> 
     </ns1:parenttag>   

我使用的是目前在下面的答案被芭菲提示功能: 但得到這個錯誤:

i am getting ValueError: Length mismatch: Expected axis has 21 elements, new values have 22 elements erros 

    Also it has issue with indenticaltag column as its of same name thrice but hierarchy is different 
    but in dataframe i am getting only one indenticaltag column instead of 3 for example: 
    parent.child.indenticaltag, parent.child.subchild.indenticaltag and parent.child.subchild.sub_subchild.indenticaltag etc. 

輸出數據幀爲:

I will parse both xmls differently using one function only. 
    Would like to parse all the tags and their attribute as column name in 
    pandas. Also the column name should be 
    parent.child.subchild.sub_sub_subchildtag and for attributes it should 
    be parent.child.subchild.sub_sub_childtag.attribute 

他們是否有更好的方法來解析XML並以適當的格式獲取記錄?或者我錯過了什麼?

編輯:解決方案的工作,但增加了一些更復雜

I need your help for three points if you guys can suggest some pointers: 

    1) I need columns name for pandas dataframe as root.child.subchild.grandchild i am not sure how i can get it here ? as in my solution i was able to get. 
    2) the descendant function is very slow is any way we can speed it up ? 
    3) i have to multiple xml of same type present in one directory and i would like to generate one dataframe out of it by appending all xml results any best way to do ? 

回答

1

考慮一個在<xs:topcol>節點上使用lxml的xpath(),並使用lxml的parse()直接從文件中讀取。 XPath循環迭代地附加到列表和字典容器以投射到數據框。此外,您所需的輸出實際上是不對齊節點值:

import pandas as pd 
from lxml import etree 
import re 

pd.set_option('display.width', 1000) 

NSMAP = {'row': 'http://www.row.com', 
     'row3': 'http://www.row3.com', 
     'row1': 'http://www.row1.com', 
     'xs': 'http://www.xs.com', 
     'row2': 'http://www.row2.com'} 

xmldata = etree.parse('RowAgent.xml')  

data = [] 
inner = {} 
for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP): 
    for i in el:         # PARSE CHILDREN 
     inner[i.tag] = i.text 
     if len(i.xpath('/*')) > 0:    # PARSE GRANDCHILDREN 
      for subi in i: 
       inner[subi.tag] = subi.text 

    data.append(inner) 
    inner = {} 

df = pd.DataFrame(data) 

# REGEX TO REMOVE NAMESPACE URIs IN COL NAMES 
df.columns = [re.sub(r'{.*}', '', col) for col in df.columns] 

爲了解析無限的子元素使用XPath的descendant::*

num_top_cols = len(xmldata.xpath('//xs:top_col', namespaces=NSMAP)) 

for i in range(1,num_top_cols+1): 
    for el in xmldata.xpath('//xs:top_col[{}]/descendant::*'.format(i), namespaces=NSMAP): 
     if el.text.strip()!='':     # REMOVE EMPTY TEXT TAGS 
      inner[el.tag] = el.text.strip() 

    data.append(inner) 
    inner = {} 

df = pd.DataFrame(data) 

輸出

print(df) 
# col11_1  col11_2 col8_1 col8_2  col1  col10 col12 col13_1 col2 col3 col4 col5 col6 col7 col9 
# 0  2010 AB 20/SEC001  2010 2016 00032000 test_name pqr 000330 N 0 3 N I AA N 
# 1 2016026 rty-qwe-01  2000 26000  03985  temp2 perrl 0117203 N 0 3 N a 9AA N 
# 2  8965 147A-254-044  7896 NaN  00985  mjkl rtyyu 45612 N 0 3 N NaN yuio N 
# 3 52369 ui 247/mh45 145ghg7 NaN  78965  ghyuio trwer  9874 N 0 5 N NaN 23rt N 

由於descendants::*的性能挑戰,請考慮遞歸調用以首先遍歷所有desce ndants然後再調用捕獲數據幀列的父/子/孫名稱。一定要現在使用的OrderedDict

from collections import OrderedDict 

#... same as above XML setup ... # 

def recursiveParse(curr_elem, curr_inner):  
    if len(curr_elem.xpath('/*')) > 0:   
     for child_elem in curr_elem:    
      curr_inner[child_elem.tag] = child_elem.text 
      inner[i.tag] = i.text 
      if child_elem.attrib is not None:     
       for attrib in child_elem.attrib: 
        inner[attrib] = child_elem.attrib[attrib] 
      recursiveParse(child_elem, curr_inner) 

    return(curr_inner) 

for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP): 
    for i in el:   
     inner[i.tag] = i.text 
     if i.attrib is not None: 
      for attrib in i.attrib: 
       inner[attrib] = i.attrib[attrib]     
     recursiveParse(i, inner) 

    data.append(inner) 
    inner = {} 

df = pd.DataFrame(data) 

colnames = [] 
def recursiveNames(curr_elem, curr_inner, num):  
    if len(curr_elem.xpath('/*')) > 0:   
     for child_elem in curr_elem:  
      tmp = re.sub(r'{.*}', '', child_elem.tag)    
      curr_inner.append(colnames[num-1] +'.'+ tmp) 
      if child_elem.attrib is not None:     
       for attrib in child_elem.attrib: 
        curr_inner.append(curr_inner[len(curr_inner)-1] +'.'+ attrib) 
      recursiveNames(child_elem, curr_inner, len(colnames)) 

    return(curr_inner)   

for el in xmldata.xpath('//xs:top_col[1]', namespaces=NSMAP): 
    for i in el:     
     tmp = re.sub(r'{.*}', '', i.tag) 
     colnames.append(tmp) 
     recursiveNames(i, colnames, len(colnames)) 

df.columns = colnames 

輸出

print(df) 
#  col1 col2 col3 col4 col5 col6 col7     col8 col8.col8_1 col8.col8_1.sName col8.col8_2 col9  col10     col11 col11.col11_1 col11.col11_2 col12     col13 col13.col13_1 
# 0 00032000 N 0 3 N I AA \n       2010    pqrst  2016 N test_name \n       2010 AB 20/SEC001 pqr \n       000330 
# 1  03985 N 0 3 N a 9AA \n       2000    NaN  26000 N  temp2 \n       2016026 rty-qwe-01 perrl \n       0117203 
# 2  00985 N 0 3 N NaN yuio \n       7896    NaN   NaN N  mjkl \n       8965 147A-254-044 rtyyu \n       45612 
# 3  78965 N 0 5 N NaN 23rt \n      145ghg7    NaN   NaN N  ghyuio \n       52369 ui 247/mh45 trwer \n       9874 

最後,在一個循環中集成該處理和原始的XML解析所有通過目錄中的所有XML文件進行迭代。但是,請確保將所有數據幀保存在數據框列表中,然後使用pd.concat()`追加/堆棧。

import # modules 

dfList = [] 
for f in os.list.dir('/path/to/XML/files'): 
    #...xml parse... (passing in f for file name in parse()) 
    #...dataframe build with recursive calls... 

    dfList.append(df) 

finaldf = pd.concat(dfList) 
+0

遠遠勝過我,非常感謝!一個問題,如果我們有高等級的兒童在等級制中?是否有任何標準的方法來遍歷所有的子小孩? – user07

+0

好問題,請參閱使用XPath的'descendant :: *'更新擴展,其中通過其節點索引遍歷每個''並解析其所有後代。 – Parfait

+0

你的XML有多大?超過1 GB? *你的速度有多慢?而且,屬性和文本是非常不同的。您的示例XML不包含屬性或試圖解析它們。始終發佈**實際**數據的真實示例。 – Parfait

0

您好我已經找到了上述問題的答案,發佈它,所以它可以對他人有所幫助:

xml_data = open('test.xml').read().encode('utf8') 

    def xml2df(xml_data): 
     tree = et.parse(xml_data) 
     all_records= [] 
     result= {} 
     for el in tree.iterfind("./row:agent1/row:agent2/row3:agent3/xs:top_col/",namespaces): 

      for r in el: 

       if '}' in r.tag: 
        r.tag = r.tag.split('}', 1)[1] 
      for i in el.iterfind('*'): 

       for s in i: 

        s.tag = s.tag.split('}',1)[1] 
        s.tag = i.tag +"."+s.tag    

       result[i.tag] = i.text 

       for j in i.iterfind('*'): 
        result[j.tag] = j.text 

      all_records.append(result) 

      result= {} 

     df = pd.DataFrame(data) 
     return df 
    df1 = xml2df(xml_data) 
    df1