使用Python ElementTree在XML標記中提取文本

我有一個包含數萬個XML文件（小文件）的語料庫，我試圖使用Python並提取包含在其中一個XML標記中的文本，例如，這樣的事情在身體標記之間的一切：使用Python ElementTree在XML標記中提取文本

<body> sample text here with <bold> nested </bold> tags in this paragraph </body>

，然後編寫包含此字符串文本文檔，然後繼續向下的XML文件的列表。

我正在使用effbot的ELementTree，但無法找到正確的命令/語法來執行此操作。我發現了一個使用miniDOM的dom.getElementsByTagName的網站，但我不確定ElementTree的相應方法。任何想法將不勝感激。

來源

2012-06-16 Levar

我與閱讀一些教程，然後開始; [潛入Python 3 XML章節]（http://getpython3.com/diveintopython3/xml.html）將是一個好的開始。 –

在你的例子中，你是否也想要標籤''或者只有它裏面的文字？ –

「body」標籤之外還有其他內容嗎？ – poke

我只想用重：

import re 
body_txt = re.match('<body>(.*)</body>',body_txt).groups()[0]

然後刪除內部標籤：

body_txt = re.sub('<.*?>','',body_txt)

你不應該使用正則表達式是不需要的時候，這是真的......但有在使用它們時沒有任何問題。

來源

2012-06-18 19:44:52 Scruffy

一個更好的答案，顯示如何實際使用XML解析來做到這一點：

import xml.etree.ElementTree as ET 
stringofxml = "<body> sample text here with <bold> nested </bold> tags in this paragraph </body>" 

def extractTextFromElement(elementName, stringofxml): 
    tree = ET.fromstring(stringofxml) 
    for child in tree: 
     if child.tag == elementName: 
      return child.text.strip() 

print extractTextFromElement('bold', stringofxml)

來源

2013-08-20 19:09:53 Hawkwing

使用Python ElementTree在XML標記中提取文本

回答

相關問題