在Python中處理XML的真正簡單的方法是什麼？

爲了彌補recently asked question，我開始懷疑是否有一種非常簡單的方式來處理Python中的XML文檔。一種pythonic方式，如果你願意的話。在Python中處理XML的真正簡單的方法是什麼？

也許我可以解釋的最好的，如果我給例子：假設如下 - 我認爲這是的XML是如何在Web服務中使用（MIS）一個很好的例子 - 在響應我從http請求得到http://www.google.com/ig/api?weather=94043

<xml_api_reply version="1"> 
    <weather module_id="0" tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" > 
    <forecast_information> 
     <city data="Mountain View, CA"/> 
     <postal_code data="94043"/> 
     <latitude_e6 data=""/> 
     <longitude_e6 data=""/> 
     <forecast_date data="2010-06-23"/> 
     <current_date_time data="2010-06-24 00:02:54 +0000"/> 
     <unit_system data="US"/> 
    </forecast_information> 
    <current_conditions> 
     <condition data="Sunny"/> 
     <temp_f data="68"/> 
     <temp_c data="20"/> 
     <humidity data="Humidity: 61%"/> 
     <icon data="/ig/images/weather/sunny.gif"/> 
     <wind_condition data="Wind: NW at 19 mph"/> 
    </current_conditions> 
    ... 
    <forecast_conditions> 
     <day_of_week data="Sat"/> 
     <low data="59"/> 
     <high data="75"/> 
     <icon data="/ig/images/weather/partly_cloudy.gif"/> 
     <condition data="Partly Cloudy"/> 
    </forecast_conditions> 
    </weather> 
</xml_api_reply>

裝載後/解析該文件，我想能夠訪問的信息那樣簡單說

>>> xml['xml_api_reply']['weather']['forecast_information']['city'].data 
'Mountain View, CA'

或

>>> xml.xml_api_reply.weather.current_conditions.temp_f['data'] 
'68'

從我目前看到的，似乎ElementTree是最接近我的夢想。但它並不存在，在使用XML時仍然有一些模糊之處。 OTOH，我的想法並不那麼複雜 - 可能只是一個簡單的解析器 - 而且它可以減少處理XML的煩惱。有這樣的魔法嗎？（如果不是 - 爲什麼？）

PS。注意我已經嘗試過BeautifulSoup，雖然我喜歡它的方法，但它有空的<element/> s存在實際問題 - 請參閱下面的示例註釋。

來源

2010-06-24 Nas Banov

ElementTree可能是最好的，而不依賴於第三方庫。 – carl 2010-06-24 00:27:57

我認爲lxml.objectify是這個問題的完美解決方案。 – shahjapan 2010-06-24 02:56:35

你正在尋找的東西看起來很像Perl的XML :: Simple CPAN模塊，它很適合很多常規的XML工作。有人感到勤奮可能可以實現類似於etree wrapper的東西。 – 2010-06-24 04:56:15

你想要一個薄薄的貼面？這很容易做飯。嘗試周圍的ElementTree以下瑣碎的包裝作爲開始：

# geetree.py 
import xml.etree.ElementTree as ET 

class GeeElem(object): 
    """Wrapper around an ElementTree element. a['foo'] gets the 
     attribute foo, a.foo gets the first subelement foo.""" 
    def __init__(self, elem): 
     self.etElem = elem 

    def __getitem__(self, name): 
     res = self._getattr(name) 
     if res is None: 
      raise AttributeError, "No attribute named '%s'" % name 
     return res 

    def __getattr__(self, name): 
     res = self._getelem(name) 
     if res is None: 
      raise IndexError, "No element named '%s'" % name 
     return res 

    def _getelem(self, name): 
     res = self.etElem.find(name) 
     if res is None: 
      return None 
     return GeeElem(res) 

    def _getattr(self, name): 
     return self.etElem.get(name) 

class GeeTree(object): 
    "Wrapper around an ElementTree." 
    def __init__(self, fname): 
     self.doc = ET.parse(fname) 

    def __getattr__(self, name): 
     if self.doc.getroot().tag != name: 
      raise IndexError, "No element named '%s'" % name 
     return GeeElem(self.doc.getroot()) 

    def getroot(self): 
     return self.doc.getroot()

你調用它這樣：

>>> import geetree 
>>> t = geetree.GeeTree('foo.xml') 
>>> t.xml_api_reply.weather.forecast_information.city['data'] 
'Mountain View, CA' 
>>> t.xml_api_reply.weather.current_conditions.temp_f['data'] 
'68'

來源

2010-06-24 01:53:17

有了這個經驗，如何從這個XML獲得'Blah'：'Blah'？我嘗試過'doc.a.b'，但這不會給布拉...... – Basj 2014-02-14 15:50:52

@Basj：這個單板不支持。如果您想要答案，您可以將其作爲單獨的問題提出。提示：您必須更改'__getattr__'來允許您訪問任何對象屬性或子元素，並且您必須處理兩者之間不可避免的衝突（如果我有方法「文本」和一個「文本」副本），例如？）。這可能會給你一些見解，爲什麼'ElementTree'沒有按照這裏所要求的那麼簡單的設計... – 2014-03-16 19:45:12

THanks @OwenS的回答。我終於使用：http://stackoverflow.com/a/10077069/1422096，這是我至少在其他人中發現的最好/最簡單的解決方案，我試過 – Basj 2014-03-16 20:21:34

如果你不介意使用第三方庫，然後BeautifulSoup會做幾乎正是你問什麼：

>>> from BeautifulSoup import BeautifulStoneSoup 
>>> soup = BeautifulStoneSoup('''<snip>''') 
>>> soup.xml_api_reply.weather.current_conditions.temp_f['data'] 
u'68'

來源

2010-06-24 00:29:24

我已經看過美麗[石]湯 - 但它是**破**（如記錄http://www.crummy.com/software/BeautifulSoup/documentation.html）空標籤如 - 這是在這個例子。例如'soup.xml_api_reply.weather.current_conditions.icon'返回' '或者你可以通過'soup.xml_api_reply.weather.current_conditions.condition.temp_f.temp_c ['data']''獲得'temp_c'，這似乎對我有所貶低 – 2010-06-24 00:43:37

它會工作，如果1）你知道你正在尋找什麼空標籤因爲，2）他們可以依靠是空的。然後，您可以將它們指定爲解析器的參數：selfClosingTags = ['city'，'postal_code'，...] – 2010-06-24 01:10:18

@Owen S：，selfClosingTags確實對StoneSoup有所幫助，但不應該真的那樣做。這個例子充滿了空標籤（應該是屬性...但不是） - 在許多XML中都是這樣的 – 2010-06-24 03:40:35

-1

如果您還沒有準備好，我會建議尋找到DOM API for Python。 DOM是一個相當廣泛使用的XML解釋系統，所以它應該非常強大。

這可能比您描述的要複雜一點，但是它來自於嘗試保留XML標記中隱含的所有信息，而不是來自糟糕的設計。

來源

2010-06-24 00:30:35 tlayton

這個問題專門用於簡單的pythonic XML訪問。我認爲DOM是很多東西（其中不包括邪惡的東西），但是「易」和「pythonic」最肯定是_not_。恢復到DOM與XML交互就像下降到C（或更糟糕的是，程序集）爲一個Web應用程序 - 它應該很少，只有非常好的理由。 – 2010-06-24 02:18:20

我還想指出，這不是因爲保留了XML結構;這是因爲該庫努力爲其接口堅持使用跨語言API。有更多的Pythonic庫相當精確地保留了XML結構。 – 2010-06-24 21:03:08

看看Amara 2，特別是this tutorial的Bindery部分。

它的工作方式與您所描述的方式非常相似。

另一方面。 ElementTree的find*()方法可以爲您提供90％的支持，並與Python打包在一起。

來源

2010-06-24 00:35:37

我看起來確實是'amara.bindery'做我正在尋找 - 但它看起來太大了（600k安裝程序，3MB源代碼） - 就像有人說，我想要一個香蕉，但現在我得到一個「免費」大猩猩它。 Re ElementTree find *（） - 它很接近但缺乏我正在考慮的pythonic []/iterator veneer – 2010-06-24 03:17:44

我相信內置的python xml模塊可以做到這一點。請看「xml.parsers.expat」

xml.parsers.expat

來源

2010-06-24 01:02:44 iform

一個底層SAXlike解析器接口爲解析的XML文檔提供了Pythonic對象接口？我錯過了什麼？ – 2010-06-24 02:03:04

我強烈建議lxml.etree和XPath解析和分析數據。這是一個完整的例子。我截斷了xml以使它更易於閱讀。

import lxml.etree 

s = """<?xml version="1.0" encoding="utf-8"?> 
<xml_api_reply version="1"> 
    <weather module_id="0" tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" > 
    <forecast_information> 
     <city data="Mountain View, CA"/> <forecast_date data="2010-06-23"/> 
    </forecast_information> 
    <forecast_conditions> 
     <day_of_week data="Sat"/> 
     <low data="59"/> 
     <high data="75"/> 
     <icon data="/ig/images/weather/partly_cloudy.gif"/> 
     <condition data="Partly Cloudy"/> 
    </forecast_conditions> 
    </weather> 
</xml_api_reply>""" 

tree = lxml.etree.fromstring(s) 
for weather in tree.xpath('/xml_api_reply/weather'): 
    print weather.find('forecast_information/city/@data')[0] 
    print weather.find('forecast_information/forecast_date/@data')[0] 
    print weather.find('forecast_conditions/low/@data')[0] 
    print weather.find('forecast_conditions/high/@data')[0]

來源

2010-06-24 01:20:49 Jerub

似乎確實很容易，我注意到 - 但它比pythonic更多xpath-ish（xpathologic？）。 – 2010-06-29 16:57:59

LXML已經提到。你也可以檢查出lxml.objectify進行一些非常簡單的操作。

>>> from lxml import objectify 
>>> tree = objectify.fromstring(your_xml) 
>>> tree.weather.attrib["module_id"] 
'0' 
>>> tree.weather.forecast_information.city.attrib["data"] 
'Mountain View, CA' 
>>> tree.weather.forecast_information.postal_code.attrib["data"] 
'94043'

來源

2010-06-24 02:27:54

++。是否要求，雖然就像在Amara案例中一樣，免費大猩猩（一噸非發行版圖書館）與香蕉訂單一起出售。順便說一句，似乎也可以使用'.get（'data'）'而不是'.attrib ['data']' – 2010-06-29 19:48:36

皁液項目提供了一個Web服務客戶端庫，幾乎完全按照你描述的作品 - 它提供一個WSDL，然後使用工廠方法來創建自定義類型（太處理反應！）。

來源

2010-06-24 03:01:17

好吧，這很有趣......但不是名稱**泡沫**暗示這只是爲了與** SOAP **一起使用？上面的例子不是SOAPy，我不想去滑的SOAP。另外，我在哪裏可以找到WSDL - 例如 - 上面的天氣服務？ – 2010-06-24 03:28:21

是的，你是對的 - 泡沫絕對是針對SOAP Web服務的，而不是通用的XML。 WSDL將是這種服務的已發佈合同。我的不好，我假設你已經在發佈之前清洗了這個示例中的氣泡:-) – 2010-06-24 12:46:41

我發現了下面的python-simplexml模塊，這個模塊試圖讓作者從PHP中獲得一些接近SimpleXML的東西，確實是small wrapper around ElementTree。它不到100行，但似乎做要求：

>>> import SimpleXml 
>>> x = SimpleXml.parse(urllib.urlopen('http://www.google.com/ig/api?weather=94043')) 
>>> print x.weather.current_conditions.temp_f['data'] 
58

來源

2010-06-24 05:08:22

在Python中處理XML的真正簡單的方法是什麼？

回答

相關問題