2011-05-13 56 views
4

我已經成功地使用xml.etree.ElementTree解析xml,搜索內容,然後將其寫入不同的xml。不過,我只是在文本中使用單個標籤。搜索/替換xml的內容

import os, sys, glob, xml.etree.ElementTree as ET 
path = r"G:\\63D RRC GIS Data\\metadata\\general\\2010_contract" 
for fn in os.listdir(path): 
    filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml") 
    for filepath in filepaths: 
     (pa, filename) = os.path.split(filepath) 
     ####use this section to grab element text from old, archived metadata files; this text then gets put into current, working .xml### 
     root = ET.parse(pa + os.sep + "archive" + os.sep + "base_metadata_overall.xml").getroot() 

     iterator = root.getiterator() 
     for item in iterator: 
      if item.tag == "abstract": 
       correct_abstract = item.text 

     root2 = ET.parse(pa + os.sep + "base_metadata_overall.xml").getroot() 

     iterator2 = root2.getiterator("descript") 
     for item in iterator2: 
      if item.tag == "abstract": 
       old_abstract = item.find("abstract") 
       old_abstract_text = old_abstract.text 
       item.remove(old_abstract) 
       new_symbol_abstract = ET.SubElement(item, "title") 
       new_symbol_abstract.text = correct_abstract     
     tree = ET.ElementTree(root2) 
     tree.write(pa + os.sep + "base_metadata_overall.xml") 
     print "created --- " + filename + " metadata" 

但現在,我需要:

1)搜索的XML和搶 「ATTR」 標籤之間的所有內容,下面是例子:

<attr><attrlabl Sync="TRUE">OBJECTID</attrlabl><attalias Sync="TRUE">ObjectIdentifier</attalias><attrtype Sync="TRUE">OID</attrtype><attwidth Sync="TRUE">4</attwidth><atprecis Sync="TRUE">0</atprecis><attscale Sync="TRUE">0</attscale><attrdef Sync="TRUE">Internal feature number.</attrdef></attr> 

2)現在,我需要打開一個不同的XML並搜索相同的「attr」標籤之間的所有內容,並替換上述內容。

基本上,我以前在做什麼,但忽略了「attr」標籤之間的子元素,屬性等等,並將其當作文本對待。

謝謝!!

請耐心等待,這個論壇有點不同(發貼)然後我習慣了!

這是我到目前爲止有:

import os, sys, glob, re, xml.etree.ElementTree as ET 
from lxml import etree 

path = r"C:\\temp\\python\\xml" 
for fn in os.listdir(path): 
    filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml") 
    for filepath in filepaths: 
      (pa, filename) = os.path.split(filepath) 

      xml = open(pa + os.sep + "attributes.xml") 
      xmltext = xml.read() 
      correct_attrs = re.findall("<attr> (.*?)</attr>",xmltext,re.DOTALL) 
      for item in correct_attrs: 
       correct_attribute = "<attr>" + item + "</attr>" 

       xml2 = open(pa + os.sep + "base_metadata_overall.xml") 
       xmltext2 = xml2.read() 
       old_attrs = re.findall("<attr>(.*?)</attr>",xmltext,re.DOTALL) 
       for item2 in old_attrs: 
        old_attribute = "<attr>" + item + "</attr>"    



        old = etree.fromstring(old_attribute) 
        replacement = new.xpath('//attr') 
        for attr in old.xpath('//attr'): 
         attr.getparent().replace(attr, copy.deepcopy(replacement)) 
         print lxml.etree.tostring(old) 

得到了這個工作,見下文,甚至想出瞭如何導出到新的.xml 然而,如果ATTR年代#是DIF。從源到目標,我得到以下錯誤,有什麼建議?

節點= replacements.pop()

IndexError:從空列表彈出

import os, sys, glob, re, copy, lxml, xml.etree.ElementTree as ET 
from lxml import etree 
path = r"C:\\temp\\python\\xml" 
for fn in os.listdir(path): 
filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml") 
for filepath in filepaths: 
     xmlatributes = open(pa + os.sep + "attributes.xml") 
     xmlatributes_txt = xmlatributes.read() 
     xmltarget = open(pa + os.sep + "base_metadata_overall.xml") 
     xmltarget_txt = xmltarget.read() 
     source = lxml.etree.fromstring(xmlatributes_txt) 
     dest = lxml.etree.fromstring(xmltarget_txt)    




     replacements = source.xpath('//attr') 
     replacements.reverse() 


     for attr in dest.xpath('//attr'): 
      node = replacements.pop() 
      attr.getparent().replace(attr, copy.deepcopy(node)) 
     #print lxml.etree.tostring(dest) 
     tree = ET.ElementTree(dest) 
     tree.write (pa + os.sep + "edited_metadata.xml") 
     print fn + "--- sucessfully edited" 

2011/5/16更新 改制的幾件事情解決了「IndexError:從彈出空列表「上面提到的錯誤。意識到替換「attr」標籤並不總是1比1的替代品。例如。有時源.xml有20個屬性,目標.xml有25個屬性。在這種情況下,1對1替換將會窒息。

無論如何,下面將刪除所有屬性,然後替換源attr的。它還檢查另一個標籤「subtype」(如果它存在),它將它們添加到attr之後,但在「詳細」標籤內。

再次感謝所有幫助過的人。

import os, sys, glob, re, copy, lxml, xml.etree.ElementTree as ET 
from lxml import etree 
path = r"G:\\63D RRC GIS Data\\metadata\\general\\2010_contract" 
#path = r"C:\\temp\python\\xml" 
for fn in os.listdir(path): 
    correct_title = fn.replace ('_', ' ') + " various facilities" 
    correct_fc_name = fn.replace ('_', ' ') 
    filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml") 
    for filepath in filepaths: 
     print "-----" + fn + "-----" 
     (pa, filename) = os.path.split(filepath) 
     xmlatributes = open(pa + os.sep + "attributes.xml") 
     xmlatributes_txt = xmlatributes.read() 
     xmltarget = open(pa + os.sep + "base_metadata_overall.xml") 
     xmltarget_txt = xmltarget.read() 
     source = lxml.etree.fromstring(xmlatributes_txt) 
     dest = lxml.etree.fromstring(xmltarget_txt) 
     replacements = source.xpath('//attr') 
     replacesubtypes = source.xpath('//subtype') 
     subtype_true_f = len(replacesubtypes) 

     attrtag = dest.xpath('//attr') 
     #print len(attrtag) 
     num_realatrs = len(replacements) 
     for n in attrtag: 
      n.getparent().remove(n) 
     print n.tag + " removed" 

     detailedtag = dest.xpath('//detailed') 
     for n2 in detailedtag: 
      pos = 0 
      for realatrs in replacements: 
       n2.insert(pos + 1, realatrs) 
      print "attr's replaced" 
      if subtype_true_f >= 1: 
       #print subtype_true_f 
       for realsubtypes in replacesubtypes: 
        n2.insert(num_realatrs + 1, realsubtypes) 
       print "subtype's replaced" 

     tree = ET.ElementTree(dest) 
     tree.write (pa + os.sep + "base_metadata_overall_v2.xml") 
     print fn + "--- sucessfully edited" 

回答

0

這聽起來像是XSL-T轉換的原因。你嘗試過嗎?

我還建議像美麗的湯一樣的庫來解析和操作XML。

+0

上的程序,我測量執行的次和BeautifulSoup似乎比正則表達式解決方案慢1000倍。我不假裝它是一般的,這是一個案例,但這種差異是顯着的,雖然 – eyquem 2011-05-13 14:27:48

1

這裏是使用lxml來做到這一點的一個例子。我不是,確切地說是當然你希望<attr/>節點被替換,但這個例子應該提供一個你可以重用的模式。

更新 - 我改變了它與來自樹1的相應節點來替換每個<attr>在tree2文檔順序:

import copy 
import lxml.etree 

xml1 = '''<root><attr><chaos foo="0"/></attr><attr><arena foo="1"/></attr></root>''' 
xml2 = '''<tree><attr><one/></attr><attr><two/></attr></tree>''' 
tree1 = lxml.etree.fromstring(xml1) 
tree2 = lxml.etree.fromstring(xml2) 

# select <attr/> nodes from tree1, will be used to replace corresponding 
# nodes in tree2 
replacements = tree1.xpath('//attr') 
replacements.reverse() 

for attr in tree2.xpath('//attr'): 
    # replace the attr node in tree2 with 'replacement' from tree1 
    node = replacements.pop() 
    attr.getparent().replace(attr, copy.deepcopy(node)) 

print lxml.etree.tostring(tree2) 

結果:

<tree> 
    <attr><chaos foo="0"/></attr> 
    <attr><arena foo="1"/></attr> 
</tree> 
+0

謝謝。本來應該更清楚。帖子。源和目標xml中有幾個「attr」屬性,它們都包含不同的內容。我想刪除目標中的所有「attr」,並從源代碼中全部替換爲「attr」。 – dan 2011-05-13 17:59:12

+0

我可以抓住orig。 「attr」的內容和來源「attr」的內容,但我不知道如何做替換?? – dan 2011-05-13 18:02:21

+0

你如何確定哪個目標attr被哪個源attr替換?你是否按文件順序替換它們? – samplebias 2011-05-13 18:06:11