條件移除元素的

我的任務就是做XML樹的一些元素的微小的重構在Python 3，即替換以下結構：條件移除元素的

<span class="nobr"> 
<a href="http://www.google.com/"> 
    http://www.google.com/ 
    <sup> 
    <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/> 
    </sup> 
</a> 
</span>

有了：

<span class="nobr"> 
<a href="http://www.google.com/"> 
    http://www.google.com/ 
</a> 
</span>

即 - 如果整個結構與第一個例子中給出的結構完全一致，請移除sup元素。我需要在處理過程中保留XML文檔，所以正則表達式匹配不是可能的。

我已經有代碼的工作，我的目的：

doc = self.__refactor_links(doc) 
... 
def __refactor_links(self, node): 
    """Recursively seeks for links to refactor them""" 
    for span in node.childNodes: 
     replace = False 
     if isinstance(span, xml.dom.minidom.Element): 
      if span.tagName == "span" and span.getAttribute("class") == "nobr": 
       if span.childNodes.length == 1: 
        a = span.childNodes.item(0) 
        if isinstance(a, xml.dom.minidom.Element): 
         if a.tagName == "a" and a.getAttribute("href"): 
          if a.childNodes.length == 2: 
           aurl = a.childNodes.item(0) 
           if isinstance(aurl, xml.dom.minidom.Text): 
            sup = a.childNodes.item(1) 
            if isinstance(sup, xml.dom.minidom.Element): 
             if sup.tagName == "sup": 
              if sup.childNodes.length == 1: 
               img = sup.childNodes.item(0) 
               if isinstance(img, xml.dom.minidom.Element): 
                if img.tagName == "img" and img.getAttribute("class") == "rendericon": 
                 replace = True 
      else: 
       self.__refactor_links(span) 
     if replace: 
      a.removeChild(sup) 
    return node

這一次不會通過所有的標籤遞歸地運行 - 如果它匹配相似，它尋求結構的東西 - 即使它失敗，它不會繼續尋找這些元素內部的結構，但在我的情況下，我不應該這樣做（雖然這也會很好，但是增加一堆其他成本：self .__ refactor_links（tag）kill它在我眼中）。

如果任何條件失敗，則不應該發生移除。有沒有更清晰的方式來定義一組條件，避免大量'ifs'？一些自定義數據結構可以用於存儲條件，例如，（'sup'，（'img'，（...））），但我不知道應該如何處理它。如果你在Python中有任何建議或例子 - 請幫忙。

謝謝。

來源

2010-11-11 DarkPhoenix

Ouch。 'import this''：'... Flat比嵌套更好。 ...' – 2010-11-12 00:38:23

這絕對是XPath表達式的一個任務，在您的情況下可能與lxml一起使用。

的XPath可能是沿着線的東西：

//span[@class="nobr"]/a[@href]/sup[img/@class="rendericon"]

彰顯樹與此XPath表達式，並刪除所有匹配的元素。如果構造或遞歸沒有必要。

來源

2010-11-11 21:36:52 stefanw

感謝您指出XPath，從未使用它。我重寫了所有的東西來使用xml.etree.ElementTree而不是xml.dom.minidom。 ElementTree 1.3支持我需要的所有XPath特性（http://effbot.org/zone/element-xpath.htm），所以我不得不切換到python 3.2（3.1的當前穩定版本有1.2.6）。 – DarkPhoenix 2010-11-12 14:58:17

我不擅長與XML，但不能使用節點上

>>> from xml.dom.minidom import parse, parseString 
>>> dom = parseString(x) 
>>> k = dom.getElementsByTagName('sup') 
>>> for l in k: 
...  p = l.parentNode 
...  p.removeChild(l) 
... 
<DOM Element: sup at 0x100587d40> 
>>> 
>>> print dom.toxml() 
<?xml version="1.0" ?><span class="nobr"> 
<a href="http://www.google.com/"> 
    http://www.google.com/ 

</a> 
</span> 
>>>

來源

2010-11-11 21:37:09 pyfunc

下面是與lxml快速事情查找/搜索。強烈推薦xpath。

>>> from lxml import etree 
>>> doc = etree.XML("""<span class="nobr"> 
... <a href="http://www.google.com/"> 
... http://www.google.com/ 
... <sup> 
... <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/> 
... </sup> 
... </a> 
... </span>""") 
>>> for a in doc.xpath('//span[@class="nobr"]/a[@href="http://www.google.com/"]'): 
...  for sub in list(a): 
...   a.remove(sub) 
... 
>>> print etree.tostring(doc,pretty_print=True) 
<span class="nobr"> 
<a href="http://www.google.com/"> 
    http://www.google.com/ 
    </a> 
</span>

來源

2010-11-11 21:47:34 MattH

輕鬆使用lxml和XSLT來實現：

>>> from lxml import etree 
>>> from StringIO import StringIO 
>>> # create the stylesheet 
>>> xslt = StringIO(""" 
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 
    <!-- this is the standard identity transform --> 
    <xsl:template match="@* | node()"> 
    <xsl:copy> 
     <xsl:apply-templates select="@* | node()"/> 
    </xsl:copy> 
    </xsl:template> 
    <!-- this replaces the specific node you're looking to replace --> 
    <xsl:template match="span[a[@href='http://www.google.com' and 
        sup[img[ 
         @align='absmiddle' and 
         @border='0' and 
         @class='rendericon' and 
         @height='7' and 
         @src='http://jira.atlassian.com/icon.gif' and 
         @width='7']]]]"> 
    <span class="nobr"> 
     <a href="http://www.google.com/">http://www.google.com/</a> 
    </span> 
    </xsl:template> 
</xsl:stylesheet>""") 
>>> # create a transform function from the XSLT stylesheet 
>>> transform = etree.XSLT(etree.parse(xslt)) 
>>> # here's a sample source XML instance for testing 
>>> source = StringIO(""" 
<test> 
    <span class="nobr"> 
    <a href="http://www.google.com/"> 
    http://www.google.com/ 
    <sup> 
    <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/> 
    </sup> 
    </a> 
    </span> 
</test>""") 
>>> # parse the source, transform it to an XSLT result tree, and print the result 
>>> print etree.tostring(transform(etree.parse(source))) 
<test> 
    <span class="nobr"><a href="http://www.google.com/">http://www.google.com/</a></span> 
</test>

編輯：

我要指出，沒有一個答案 - 不是我的，不是MattH的，當然不是實例OP張貼 - 做什麼OP要求，這是隻取代其結構正好匹配的元素

<span class="nobr"> 
    <a href="http://www.google.com/"> 
    http://www.google.com/ 
    <sup> 
    <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/> 
    </sup> 
    </a> 
</span>

例如，所有的這些例子將取代sup如果img有style屬性，或者如果sup有除了img另一個孩子。

構建XPath表達式可能會更加嚴格。例如，而不是使用

span[a]

與至少一個 a孩子任何 span匹配

，您可以使用

span[count(@*)=0 and count(*)=1 and a]

它不具有屬性的任何span和只有一個子元素，其中匹配那個孩子是a。你可以去用這個漂亮的瘋狂在你的追求精密，例如：

span[count(@*) = 1 and 
    @class='nobr' and 
    count(*) = 1 and 
    a[count(@*) = 1 and 
     @href='http://www.google.com' and 
     count(*) = 1 and 
     sup[count(@*) = 0 and 
      count(*) = 1 and 
      img[count(*) = 0 and 
       count(@*) = 7 and 
       @align='absmiddle' and 
       @alt='' and 
       @border='0' and 
       @class='rendericon' and 
       @height='7' and 
       @src='http://jira.atlassian.com/icon.gif' and 
       @width='7']]]]

其中，在匹配的每一步，確保元素匹配只包含完全相同的屬性和指定的元素並沒有更多的。（並且它仍然不驗證它們不包含文本，註釋或處理指令 - 如果您確實嚴肅認真，請在任何地方使用count(node())，這是使用count(*)。）

來源

2010-11-12 00:29:13

條件移除元素的

回答

相關問題