如何獲得xml文件中的特定節點與蟒蛇

IM從蟒蛇DOM非常大的XML文檔內置模塊
例如尋找一種方式來獲得一個特定的標籤..：
如何獲得xml文件中的特定節點與蟒蛇

<AssetType longname="characters" shortname="chr" shortnames="chrs"> 
    <type> 
    pub 
    </type> 
    <type> 
    geo 
    </type> 
    <type> 
    rig 
    </type> 
</AssetType> 

<AssetType longname="camera" shortname="cam" shortnames="cams"> 
    <type> 
    cam1 
    </type> 
    <type> 
    cam2 
    </type> 
    <type> 
    cam4 
    </type> 
</AssetType>

我想檢索AssetType節點的孩子誰得到屬性的（LONGNAME =「字符」）的價值有'pub','geo','rig'
結果請把記住，我有超過1000 < AssetType>節點
thanx

來源

2010-02-09 Moayyad Yaghi

如果你不介意的整個文件加載到內存中：

from lxml import etree 
data = etree.parse(fname) 
result = [node.text.strip() 
    for node in data.xpath("//AssetType[@longname='characters']/type")]

您可能需要在你的代碼的開頭以刪除空格，使這項工作。

來源

2010-02-09 16:42:35 eswald

這也是我的方法。請記住，它需要安裝lxml模塊，它不是默認Python庫的一部分。但是，我現在在一個項目中使用它，其中一些XML文件的大小爲65兆，並且不會抱怨（與腳本的作者相反）。 – Tom 2010-02-09 16:44:50

用於'lxml.etree'的+1，它遠遠優於'ElementTree'的默認安裝。 – jathanism 2010-02-09 17:14:06

使用xml.sax模塊。建立你自己的處理程序，在startElement裏面，你應該檢查名稱是否是AssetType。這樣，您應該只能在處理AssetType節點時採取行動。

Here你有例子的處理程序，這表明，如何建立一個（雖然它不是最漂亮的方式，在這一點上，我不知道所有與Python ;-)的很酷的技巧）。

來源

2010-02-09 16:32:42 gruszczy

您可以使用xpath，如「// AssetType [longname ='characters']/xyz」。

在Python中的XPath庫看到http://www.somebits.com/weblog/tech/python/xpath.html

來源

2010-02-09 16:34:05 ron

您可以使用pulldom API來解析一個大文件，而不是一次加載到內存中。與使用SAX相比，這提供了更方便的界面，但性能稍有下降。

它基本上可以讓你流xml文件，直到你找到你感興趣的位，然後開始使用regular DOM operations。


from xml.dom import pulldom 

# http://mail.python.org/pipermail/xml-sig/2005-March/011022.html 
def getInnerText(oNode): 
    rc = "" 
    nodelist = oNode.childNodes 
    for node in nodelist: 
     if node.nodeType == node.TEXT_NODE: 
      rc = rc + node.data 
     elif node.nodeType==node.ELEMENT_NODE: 
      rc = rc + getInnerText(node) # recursive !!! 
     elif node.nodeType==node.CDATA_SECTION_NODE: 
      rc = rc + node.data 
     else: 
      # node.nodeType: PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, NOTATION_NODE and so on 
      pass 
    return rc 


# xml_file is either a filename or a file 
stream = pulldom.parse(xml_file) 
for event, node in stream: 
    if event == "START_ELEMENT" and node.nodeName == "AssetType": 
     if node.getAttribute("longname") == "characters": 
      stream.expandNode(node) # node now contains a mini-dom tree 
      type_nodes = node.getElementsByTagName('type') 
      for type_node in type_nodes: 
       # type_text will have the value of what's inside the type text 
       type_text = getInnerText(type_node)

來源

2010-02-09 16:50:11

到eswald的解決方案類似，再次剝離空白，文檔再次加載到內存中，但一次

from lxml import etree 

data = """<AssetType longname="characters" shortname="chr" shortnames="chrs" 
    <type> 
    pub 
    </type> 
    <type> 
    geo 
    </type> 
    <type> 
    rig 
    </type> 
</AssetType> 
""" 

doc = etree.XML(data) 

for asset in doc.xpath('//AssetType[@longname="characters"]'): 
    threetypes = [ x.strip() for x in asset.xpath('./type/text()') ] 
    print threetypes

來源

2010-02-09 16:56:06 MattH

返回三個文本項假設你的文檔稱爲assets.xml，並具有以下結構：

<assets> 
    <AssetType> 
     ... 
    </AssetType> 
    <AssetType> 
     ... 
    </AssetType> 
</assets>

然後你就可以做到以下幾點：

from xml.etree.ElementTree import ElementTree 
tree = ElementTree() 
root = tree.parse("assets.xml") 
for assetType in root.findall("//AssetType[@longname='characters']"): 
    for type in assetType.getchildren(): 
     print type.text

來源

2010-02-09 16:58:52

默認解決方案爲+1。可攜帶性岩石！ – jathanism 2010-02-09 17:14:44

如何獲得xml文件中的特定節點與蟒蛇

回答

相關問題