2015-06-14 64 views
2

說我有以下XML:如何將XML文檔拆分爲特定標籤之間的字符串?

<foo> 
<spam taste="great"> stuff</spam> <spam taste="moldy"> stuff</spam> 
<bar taste="eww"> stuff </bar> <bar> stuff </bar> 
<bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon> 
</foo> 

垃圾郵件,酒吧和燻肉更多的標籤是數據的標籤裏面,我想將XML分成這

  • <spam taste="great"> stuff</spam> <spam taste="moldy"> stuff</spam>
  • <bar taste="eww"> stuff </bar> <bar> stuff </bar>
  • <bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon>

爲了重新排序它進行解析。

像這樣的基本結構,塊以任何順序排列。

<foo> 
block of bar tags 
block of spam tags 
block of bacon tags 
</foo> 

回答

0

你看過ElementTree methods

import xml.etree.ElementTree as ET 

document = ET.parse("file.xml") 
spams = document.findall("spam") 
bars = document.findall("bar") 
bacon = 'document.findall("bacon") 
1

如果你不知道什麼標記的名稱是在運行時+只是想分手按組的元素,可以或許嘗試聯合使用itertools.groupby你想要的任何XML解析庫:

import xml.etree.ElementTree as et 
import itertools 

raw_xml = '''<foo> 
<spam taste="great"> stuff</spam> <spam taste="moldy"> stuff</spam> 
<bar taste="eww"> stuff </bar> <bar> stuff </bar> 
<bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon><bacon taste="yum"> stuff </bacon> 
<spam taste="Great">stuff2</spam> 
</foo>''' 

groups = itertools.groupby(et.fromstring(raw_xml), lambda element: element.tag) 
groups = [list(group[1]) for group in groups] 

print groups 

輸出然後將:

[[<Element 'spam' at 0x218ecb0>, <Element 'spam' at 0x218ee10>], 
[<Element 'bar' at 0x218ee90>, <Element 'bar' at 0x218eeb0>], 
[<Element 'bacon' at 0x218ef30>, <Element 'bacon' at 0x218ef50>, <Element 'bacon' at 0x218ef90>], 
[<Element 'spam' at 0x218efd0>]] 

如果您需要實際的字符串值,你可以這樣做:

print [[et.tostring(element) for element in group] for group in groups] 

...這將讓您:

[['<spam taste="great"> stuff</spam> ', '<spam taste="moldy">stuff</spam>\n'], 
['<bar taste="eww"> stuff </bar> ', '<bar> stuff </bar> \n'], 
['<bacon taste="yum"> stuff </bacon>', '<bacon taste="yum"> stuff </bacon>', '<bacon taste="yum">stuff </bacon>\n'], 
['<spam taste="Great">stuff2</spam>\n']] 
+0

我知道在運行時的名字,我想一起組每種類型的所有標籤。 –

+0

即使每個垃圾郵件,培根和條形碼標籤內都有許多嵌套標籤,這是否也能正常工作? –

+0

@JamesLu - 如果你知道運行時的名字,並且不關心保持每個組的獨立性(即你只想要_all_ spam元素,_all_ bar元素等等),那麼Daniel的解決方案可能對你更好。如果你想保留每個塊(因此垃圾郵件欄垃圾郵件將導致3個組),那麼我的解決方案會更好。嵌套:我的解決方案將忽略標籤內的任何嵌套+保持相同。不知道這是你在找什麼。 – Michael0x2a

相關問題