Python的正則表達式嵌套的XML元素

我有一個文本（不是正確的XML文檔），在XML標籤的一些話是這樣的：Python的正則表達式嵌套的XML元素

We have Potter the <term attrib="LINE:246">wizard</term> interacting with<term attrib="LINE:36080">witches</term> and <term attrib="LINE:360">goblins</term> talking about <term attrib="LINE:337"><term attrib="LINE:329"><term attrib="LINE:468">dark</term></term> <term attrib="LINE:375">arts</term></term> in regions to the east of Hogwarts.

我需要提取的XML標記的條款。我的問題是，我不知道是什麼的正則表達式我應該用得到這樣的嵌套元素：

<term><term>something</term><term>else</term></term>

我使用python，對我的工作，我已經嘗試了我的工作如下：

re.findall(r'(<term.+?</term>)', textfile)

但我得到的是這樣的：

<term><term>something</term>

這是不好的，因爲我錯過了休息。我也試過以下貪婪版本（這是更差）：

re.findall(r'(<term.+</term>)' , textfile)

你能幫幫我嗎？

來源

2016-05-30 E_Munch

您可能會發現http://stackoverflow.com/questions/37113364/regex-for-nested-xml-attributes有關試圖解析嵌套的XML與正則表達式有關的問題的信息... –

ObZalgo：http：// stackoverflow.com/a/1732454/4014959 :) –

只有PyPi正則表達式模塊提供遞歸正則表達式。 –

您使用的工具是錯誤的工具。正則表達式語言不能（通常）計數，所以將它用於這樣的東西將是非常脆弱的。使用一個合適的xml解析器和漂亮的前端，比如BeautifulSoup。它會節省你的時間，並獲得比正則表達式更好的結果。

見great docs的例子

來源

2016-05-30 15:25:23 oligofren

-1

也許嘗試：

text = 'We have Potter the <term attrib="LINE:246">wizard</term> interacting with<term attrib="LINE:36080">witches</term> and <term attrib="LINE:360">goblins</term> talking about <term attrib="LINE:337"><term attrib="LINE:329"><term attrib="LINE:468">dark</term></term> <term attrib="LINE:375">arts</term></term> in regions to the east of Hogwarts.' 
text = re.sub("<.+?>", '', text) 
text = re.sub(" ", " ", text) 
print(text)

這應該切出每一個<tag>和</tag>有，離開一切完好。

當然，如果有任何<標誌不是XML標籤的一部分，它將是混亂的。

來源

2016-05-30 15:30:12 Maciek

Python的正則表達式嵌套的XML元素

回答

相關問題