使用lxml，我如何閱讀嵌套元素內的文本？

我正在嘗試搜索大約500個XML文檔的某些特定短語，並輸出包含任何這些短語的任何元素的ID。目前，這是我的代碼：使用lxml，我如何閱讀嵌套元素內的文本？

from lxml import etree 
import os 
import re 

files = os.listdir('C:/Users/Me/Desktop/xml') 
search_words = ['House divided', 'Committee divided', 'on Division', 'Division List', 
       'The Ayes and the Noes',] 

for f in files: 
    doc = etree.parse('C:/Users/Me/Desktop/xml/' +f) 
    for elem in doc.iter(): 
     for word in search_words: 
      if elem.text is not None and str(elem.attrib) != "{}" and word in elem.text and len(re.findall(r'\d+', elem.text))>1: 
       votes = re.findall(r'\d+', elem.text) 
       string = str(elem.attrib)[8:-2] + "," 
       string += (str(votes[0]) + "," + str(votes[1]) + ",") 
       string += word + "," 
       string += str(elem.sourceline) 
       print string

輸入這樣會輸出正確：

<p id="S3V0001P0-01869">The House divided; Against the Motion 83; For it 23&#x2014;Majority 60.</p>

但是像這樣的嵌套元素輸入將被錯過，因爲裏面的文字不被解析爲短語：

<p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER</member><membercontribution> said, that the precedent occurred on the 8th of April, 1850, on a Motion for going into a Committee of Supply. An Amendment was moved by Captain Boldero on the subject of assistant-surgeons in the navy, when, on a division being called for, the Question was put that the words proposed to be left out stand part of the Question. The House divided, when the numbers were&#x2014;Ayes, 40; Noes, 48. The Question, "That the proposed words be added" was put and agreed to; the main Question, as amended, was put and agreed to; and the Question being then put, "That Mr. Speaker do now leave the chair," that Motion was agreed to, and the House went into Committee of Supply.</membercontribution></p>

有沒有什麼方法可以像這樣讀取嵌套元素中的文本並返回它的ID？

來源

2017-07-25 Kattletail

隨着LXML存在xpath方法和XPath具有contains功能可以用例如使用

doc = ET.fromstring('<p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER</member><membercontribution> said, that the precedent occurred on the 8th of April, 1850, on a Motion for going into a Committee of Supply. An Amendment was moved by Captain Boldero on the subject of assistant-surgeons in the navy, when, on a division being called for, the Question was put that the words proposed to be left out stand part of the Question. The House divided, when the numbers were&#x2014;Ayes, 40; Noes, 48. The Question, "That the proposed words be added" was put and agreed to; the main Question, as amended, was put and agreed to; and the Question being then put, "That Mr. Speaker do now leave the chair," that Motion was agreed to, and the House went into Committee of Supply.</membercontribution></p>') 
result = doc.xpath('//*[@id and contains(., $word)]', word = 'House divided')

來源

2017-07-25 20:30:19

您可以使用一些XPath並提取所有有趣的文本元素。我喜歡Parsel：pip install parsel。

import parsel 

data = ('<x><y><z><p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER' 
     '</member><membercontribution> said, that the precedent occurred on the ' 
     '8th of April, 1850, on a Motion ...</membercontribution></p></z></y></x>') 

selector = parsel.Selector(data) 

for para in selector.xpath('//p'): 
    id = para.xpath('@id').extract_first() 
    texts = para.xpath('*/text()').extract() 
    for text in texts: 
     # do whatever search 
     print(id, len(text), 'April' in text)

輸出：

S3V0141P0-01248 31 False 
S3V0141P0-01248 77 True

來源

2017-07-25 20:21:12

使用lxml，我如何閱讀嵌套元素內的文本？

回答

相關問題