2016-11-24 46 views
0

我有一個應用程序處理〜1-2兆字節的XML文件。聽起來不是很多,但我遇到了性能問題。用什麼來取代xml.dom.minidom以獲得可以有效醃製的東西?

由於我有一些計算邊界任務,我想加快我已經嘗試使用multiprocessing.imap來做到這一點 - 這需要酸洗這個XML數據。將含有對這個DOM的引用的數據結構進行酸洗,結果會比那些計算綁定的進程慢,並且罪魁禍首似乎是遞歸 - 我必須將遞歸限制設置爲10'000才能讓pickle首先工作: -S。

不管怎麼說,我的問題是:

如果我想從參考性能角度攻擊這個問題,我應該取代minidom命名用?標準既兼具酸洗性能又易於過渡。

爲了給你一個需要什麼樣的方法的想法,我已經粘貼了一個包裝類(有時候爲了加速getElementsByTagName調用而寫了一些)。將所有minidom節點替換爲堅持與該類相同接口的節點是可以接受的,即,我不需要來自minidom的所有方法。擺脫parentNode方法也是可以接受的(也可能是一個好主意,以提高酸洗性能)。

是的,如果我現在正在設計這個功能,我不會首先去尋找XML節點引用,但是現在需要大量的工作來解決這個問題,所以我希望能夠這樣做。而不是修補。

我應該使用python內置函數還是集合函數庫自己編寫該死的東西?

class ImmutableDOMNode(object): 
    def __init__(self, node): 
     self.node = node 
     self.cachedElementsByTagName = {} 

    @property 
    def nodeType(self): 
     return self.node.nodeType 

    @property 
    def tagName(self): 
     return self.node.tagName 

    @property 
    def ownerDocument(self): 
     return self.node.ownerDocument 

    @property 
    def nodeName(self): 
     return self.node.nodeName 

    @property 
    def nodeValue(self): 
     return self.node.nodeValue 

    @property 
    def attributes(self): 
     return self.node.attributes 

    @property 
    def parentNode(self): 
     return ImmutableDOMNode(self.node.parentNode) 

    @property 
    def firstChild(self): 
     return ImmutableDOMNode(self.node.firstChild) 

    @property 
    def childNodes(self): 
     return [ImmutableDOMNode(node) for node in self.node.childNodes] 

    def getElementsByTagName(self, name): 
     result = self.cachedElementsByTagName.get(name) 
     if result != None: 
      return result 
     uncachedResult = self.node.getElementsByTagName(name) 
     cachedResult = [ImmutableDOMNode(node) for node in uncachedResult] 
     self.cachedElementsByTagName[name] = cachedResult 
     return cachedResult 

    def getAttribute(self, qName): 
     return self.node.getAttribute(qName) 

    def toxml(self, encoding=None): 
     return self.node.toxml(encoding) 

    def toprettyxml(self, indent="", newl="", encoding=None): 
     return self.node.toprettyxml(indent, newl, encoding) 

    def appendChild(self, node): 
     raise Exception("cannot append child to immutable node") 

    def removeChild(self, node): 
     raise Exception("cannot remove child from immutable node") 

    def cloneNode(self, deep): 
     raise Exception("clone node not implemented") 

    def createElement(self, tagName): 
     raise Exception("cannot create element for immutable node") 

    def createTextNode(self, tagName): 
     raise Exception("cannot create text node for immutable node") 

    def createAttribute(self, qName): 
     raise Exception("cannot create attribute for immutable node") 

回答

0

因此,我決定只製作自己的DOM實現,以滿足我的要求,爲了幫助某人,我粘貼了下面的代碼。它取決於來自memoization library for python 2.7的lru_cache和來自Immutable dictionary, only use as a key for another dictionary的@Raymond Hettinger的不變字典。但是,如果您不介意安全/性能較低,則這些依賴關係很容易刪除。

class CycleFreeDOMNode(object): 
    def __init__(self, minidomNode=None): 
     if minidomNode is None: 
      return 
     if not isinstance(minidomNode, xml.dom.minidom.Node): 
      raise ValueError("%s needs to be instantiated with a minidom.Node" %(
       type(self).__name__ 
      )) 
     if minidomNode.nodeValue and minidomNode.childNodes: 
      raise ValueError(
       "both nodeValue and childNodes in same node are not supported" 
      ) 
     self._tagName = minidomNode.tagName \ 
      if hasattr(minidomNode, "tagName") else None 
     self._nodeType = minidomNode.nodeType 
     self._nodeName = minidomNode.nodeName 
     self._nodeValue = minidomNode.nodeValue 
     self._attributes = dict(
      item 
      for item in minidomNode.attributes.items() 
     ) if minidomNode.attributes else {} 
     self._childNodes = tuple(
      CycleFreeDOMNode(cn) 
      for cn in minidomNode.childNodes 
     ) 
     childNodesByTagName = defaultdict(list) 
     for cn in self._childNodes: 
      childNodesByTagName[cn.tagName].append(cn) 
     self._childNodesByTagName = ImmutableDict(childNodesByTagName) 

    @property 
    def nodeType(self): 
     return self._nodeType 

    @property 
    def tagName(self): 
     return self._tagName 

    @property 
    def nodeName(self): 
     return self._nodeName 

    @property 
    def nodeValue(self): 
     return self._nodeValue 

    @property 
    def attributes(self): 
     return self._attributes 

    @property 
    def firstChild(self): 
     return self._childNodes[0] if self._childNodes else None 

    @property 
    def childNodes(self): 
     return self._childNodes 

    @lru_cache(maxsize = 100) 
    def getElementsByTagName(self, name): 
     result = self._childNodesByTagName.get(name, []) 
     for cn in self.childNodes: 
      result += cn.getElementsByTagName(name) 
     return result 

    def cloneNode(self, deep=False): 
     clone = CycleFreeDOMNode() 
     clone._tagName = self._tagName 
     clone._nodeType = self._nodeType 
     clone._nodeName = self._nodeName 
     clone._nodeValue = self._nodeValue 
     clone._attributes = copy.copy(self._attributes) 
     if deep: 
      clone._childNodes = tuple(
       cn.cloneNode(deep) 
       for cn in self.childNodes 
      ) 
      childNodesByTagName = defaultdict(list) 
      for cn in clone._childNodes: 
       childNodesByTagName[cn.tagName].append(cn) 
      clone._childNodesByTagName = ImmutableDict(childNodesByTagName) 
     else: 
      clone._childNodes = tuple(cn for cn in self.childNodes) 
      clone._childNodesByTagName = self._childNodesByTagName 
     return clone 

    def toxml(self): 
     def makeXMLForContent(): 
      return self.nodeValue or "".join([ 
       cn.toxml() for cn in self.childNodes 
      ]) 

     if not self.tagName: 
      return makeXMLForContent() 
     return "<%s%s>%s</%s>" %(
      self.tagName, 
      " " + ", ".join([ 
       "%s=\"%s\"" %(k,v) 
       for k,v in self.attributes.items() 
      ]) if any(self.attributes) else "", 
      makeXMLForContent(), 
      self.tagName 
     ) 

    def getAttribute(self, name): 
     return self._attributes.get(name, "") 

    def setAttribute(self, name, value): 
     self._attributes[name] = value 
相關問題