2010-09-21 110 views
0

我需要避免在解析文本文件時在xml樹中創建雙分支。比方說,文本文件如下(行的順序是隨機的):從Python文本文件創建xml樹

BRANCH1:branch11:消息11
BRANCH1:branch12:message12
BRANCH2:branch21:message21
BRANCH2:branch22:message22

所以得到的xml樹應該有一個有兩個分支的根。這兩個分支都有兩個子分支。我用它來解析這個文本文件的Python代碼如下:

import string 
fh = open ('xmlbasic.txt', 'r') 
allLines = fh.readlines() 
fh.close() 
import xml.etree.ElementTree as ET 
root = ET.Element('root') 

for line in allLines: 
    tempv = line.split(':') 
    branch1 = ET.SubElement(root, tempv[0]) 
    branch2 = ET.SubElement(branch1, tempv[1]) 
    branch2.text = tempv[2] 

tree = ET.ElementTree(root) 
tree.write('xmlbasictree.xml') 

這段代碼的問題是,在XML樹的一個分支與來自文本文件的每一行創建。

任何建議如何避免在xml樹中創建另一個分支如果具有此名稱的分支已經存在?

回答

1
with open("xmlbasic.txt") as lines_file: 
    lines = lines_file.read() 

import xml.etree.ElementTree as ET 

root = ET.Element('root') 

for line in lines: 
    head, subhead, tail = line.split(":") 

    head_branch = root.find(head) 
    if not head_branch: 
     head_branch = ET.SubElement(root, head) 

    subhead_branch = head_branch.find(subhead) 
    if not subhead_branch: 
     subhead_branch = ET.SubElement(branch1, subhead) 

    subhead_branch.text = tail 

tree = ET.ElementTree(root) 
ET.dump(tree) 

的邏輯很簡單 - 你已經提到它在你的問題!在創建樹之前,您只需檢查樹中是否已存在樹枝。

請注意,這可能是低效的,因爲您正在搜索每一行的整個樹。這是因爲ElementTree不是爲了唯一而設計的。


如果您需要的速度(你可能沒有,尤其是對於短小的樹!),更有效的方法是使用一個defaultdict將其轉換爲ElementTree之前樹形結構存儲。

import collections 
import xml.etree.ElementTree as ET 

with open("xmlbasic.txt") as lines_file: 
    lines = lines_file.read() 

root_dict = collections.defaultdict(dict) 
for line in lines: 
    head, subhead, tail = line.split(":") 
    root_dict[head][subhead] = tail 

root = ET.Element('root') 
for head, branch in root_dict.items(): 
    head_element = ET.SubElement(root, head) 
    for subhead, tail in branch.items(): 
     ET.SubElement(head_element,subhead).text = tail 

tree = ET.ElementTree(root) 
ET.dump(tree) 
+0

謝謝,這個和其他答案都很好,但我會堅持defaultdict,因爲實際上文本和xml文件相當大。 – bitman 2010-09-21 11:54:26

0

沿着這些線?你保持分支的水平在字典中重用。

b1map = {} 

for line in allLines: 
    tempv = line.split(':') 
    branch1 = b1map.get(tempv[0]) 
    if branch1 is None: 
     branch1 = b1map[tempv[0]] = ET.SubElement(root, tempv[0]) 
    branch2 = ET.SubElement(branch1, tempv[1]) 
    branch2.text = tempv[2]