2014-01-10 75 views
1

我幾乎重複使用了這裏的相同位代碼merging xml files using python's ElementTree,並且讓它工作。該XML文件我試圖合併這個樣子的使用Python的ElementTree合併XML文件並維護CDATA標記

A.XML

<root> 
    <categories> 
    <category name="Biology" /> 
    </categories> 
    <app> 
    <mainHeader><![CDATA[AP Biology]]></mainHeader> 
    <questions> 
     <question type="0" number="1" title="Biology #1"> 
     <images /> 
     <description><![CDATA[<b>Which of the following is 
     the site of protein synthesis?</b>]]></description> 
     <category><![CDATA[Biology]]></category> 
     <choices> 
      <choice name="A"><![CDATA[Cell wall]]></choice> 
      <choice name="B" correct_answer="true"><![CDATA[Ribosomes]]></choice> 
      <choice name="C"><![CDATA[Vacuoles]]></choice> 
      <choice name="D"><![CDATA[DNA polymerase]]></choice> 
      <choice name="E"><![CDATA[RNA polymerase]]></choice> 
     </choices> 
     <explanation><![CDATA[<b>Answer:</b> B, Ribosomes. Translation, the 
     process that converts mRNA code into protein, takes place in ribosomes. 
     <br /><br /><b>Key Takeaway: </b>Ribosomes are complexes of RNA and 
     protein that are located in cell nuclei. Ribosomes catalyze both the 
     conversion of the mRNA code into amino acids as well as the assembly of 
     the individual amino acids into a peptide change that becomes a protein. 
     ]]></explanation> 
     </question> 
    </questions> 
    </app> 
</root> 

B.XML

<root> 
    <categories> 
    <category name="Biology" /> 
    </categories> 
    <app> 
    <mainHeader><![CDATA[SAT Biology]]></mainHeader> 
    <questions> 
     <question type="0" number="1" title="Biology #1"> 
     <images> 
     </images> 
     <category><![CDATA[Biology]]></category> 
     <description><![CDATA[<b>The site of cellular respiration 
     is:</b>]]></description> 
     <choices> 
      <choice name="A"><![CDATA[DNA polymerase]]></choice> 
      <choice name="B"><![CDATA[Ribosomes]]></choice> 
      <choice name="C" correct_answer="true"><![CDATA[Mitochondria]]></choice> 
      <choice name="D"><![CDATA[RNA polymerase]]></choice> 
      <choice name="E"><![CDATA[Vacuoles]]></choice> 
     </choices> 
     <explanation><![CDATA[<b>Answer:</b> C, Mitochondria. 
     The mitochondrion (plural mitochondria) is known as the 「powerhouse」 
     of the cell for its role in energy production.<br /><br /> 
     <b>Key Takeaway: </b>The mitochondrion is a membrane-bound organelle 
     found in most eukaryotic cells. The dominant role of the mitochondrion 
     is the production of ATP through cellular respiration, which is 
     dependent on the presence of oxygen. All forms of cellular 
     respiration, glycolysis, Krebs’ cycle, and oxidative phosphorylation, 
     take place within the mitochondria.]]></explanation> 
     </question> 
    </questions> 
    </app> 
</root> 

這是我曾經將它們合併

import os, os.path, sys 
import glob 
from xml.etree import ElementTree 

def run(files): 
    xml_files = glob.glob(files +"/*.xml") 
    xml_element_tree = None 
    for xml_file in xml_files: 
     data = ElementTree.parse(xml_file).getroot() 
     # print ElementTree.tostring(data) 
     for question in data.iter('questions'): 
      if xml_element_tree is None: 
       xml_element_tree = data 
       insertion_point = xml_element_tree.find('app').findall("./questions")[0] 
      else: 
       insertion_point.extend(question) 
    if xml_element_tree is not None: 
     print ElementTree.tostring(xml_element_tree) 
代碼

它的工作原理除了輸出不維護CDATA標籤。具體來說,這是我得到的結果。

<root> 
    <categories> 
    <category name="Biology" /> 
    </categories> 
    <app> 
    <mainHeader>AP Biology</mainHeader> 
    <questions> 
     <question number="1" title="Biology #1" type="0"> 
     <images /> 
     <category>Biology</category> 
     <description>&lt;b&gt;Which of the following is the site 
     of protein synthesis?&lt;/b&gt;</description> 
     <choices> 
      <choice name="A">Cell wall</choice> 
      <choice correct_answer="true" name="B">Ribosomes</choice> 
      <choice name="C">Vacuoles</choice> 
      <choice name="D">DNA polymerase</choice> 
      <choice name="E">RNA polymerase</choice> 
     </choices> 
     <explanation>&lt;b&gt;Answer:&lt;/b&gt; B, Ribosomes. 
     Translation, the process that converts mRNA code into protein, 
     takes place in ribosomes.&lt;br /&gt;&lt;br /&gt;&lt;b&gt; 
     Key Takeaway: &lt;/b&gt;Ribosomes are complexes of RNA and protein 
     that are located in cell nuclei. Ribosomes catalyze both the 
     conversion of the mRNA code into amino acids as well as the assembly 
     of the individual amino acids into a peptide change that becomes 
     a protein.</explanation> 
     </question> 
     <question number="1" title="Biology #1" type="0"> 
     <images> 
     </images> 
     <category>Biology</category> 
     <description>&lt;b&gt;The site of cellular respiration is:&lt;/b&gt; 
     </description> 
     <choices> 
      <choice name="A">DNA polymerase</choice> 
      <choice name="B">Ribosomes</choice> 
      <choice correct_answer="true" name="C">Mitochondria</choice> 
      <choice name="D">RNA polymerase</choice> 
      <choice name="E">Vacuoles</choice> 
     </choices> 
     <explanation>&lt;b&gt;Answer:&lt;/b&gt; C, Mitochondria. The 
     mitochondrion (plural mitochondria) is known as the &#8220; 
     powerhouse&#8221; of the cell for its role in energy production. 
     &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Key Takeaway: &lt;/b&gt;The 
     mitochondrion is a membrane-bound organelle found in most 
     eukaryotic cells. The dominant role of the mitochondrion is the 
     production of ATP through cellular respiration, which is dependent 
     on the presence of oxygen. All forms of cellular respiration, 
     glycolysis, Krebs&#8217; cycle, and oxidative phosphorylation, 
     take place within the mitochondria.</explanation> 
     </question> 
    </questions> 
    </app> 
</root> 

雖然我想要的輸出是這

<root> 
    <categories> 
    <category name="Biology" /> 
    </categories> 
    <app> 
    <mainHeader><![CDATA[AP Biology]]></mainHeader> 
    <questions> 
     <question type="0" number="1" title="Biology #1"> 
     <images /> 
     <category><![CDATA[Biology]]></category> 
     <description><![CDATA[<b>Which of the following is the 
     site of protein synthesis?</b>]]></description> 
     <choices> 
      <choice name="A"><![CDATA[Cell wall]]></choice> 
      <choice name="B" correct_answer="true"><![CDATA[Ribosomes]]></choice> 
      <choice name="C"><![CDATA[Vacuoles]]></choice> 
      <choice name="D"><![CDATA[DNA polymerase]]></choice> 
      <choice name="E"><![CDATA[RNA polymerase]]></choice> 
     </choices> 
     <explanation><![CDATA[<b>Answer:</b> B, Ribosomes. Translation, 
     the process that converts mRNA code into protein, takes place in 
     ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes 
     of RNA and protein that are located in cell nuclei. Ribosomes 
     catalyze both the conversion of the mRNA code into amino acids as 
     well as the assembly of the individual amino acids into a peptide 
     change that becomes a protein.]]></explanation> 
     </question> 
     <question type="0" number="2" title="Biology #1"> 
     <images /> 
     <category><![CDATA[Biology]]></category> 
     <description><![CDATA[<b>The site of cellular respiration 
     is:</b>]]></description> 
     <choices> 
      <choice name="A"><![CDATA[DNA polymerase]]></choice> 
      <choice name="B"><![CDATA[Ribosomes]]></choice> 
      <choice name="C" correct_answer="true"><![CDATA[Mitochondria]]></choice> 
      <choice name="D"><![CDATA[RNA polymerase]]></choice> 
      <choice name="E"><![CDATA[Vacuoles]]></choice> 
     </choices> 
     <explanation><![CDATA[<b>Answer:</b> C, Mitochondria. The 
     mitochondrion (plural mitochondria) is known as the 「powerhouse」 
     of the cell for its role in energy production.<br /><br /> 
     <b>Key Takeaway: </b>The mitochondrion is a membrane-bound 
     organelle found in most eukaryotic cells. The dominant role 
     of the mitochondrion is the production of ATP through cellular 
     respiration, which is dependent on the presence of oxygen. 
     All forms of cellular respiration, glycolysis, Krebs’ cycle, 
     and oxidative phosphorylation, take place within the 
     mitochondria.]]></explanation> 
     </question> 
    </questions> 
    </app> 
</root> 

如何維護我的合併輸出CDATA標籤?如何保持<b><br>"「我在合併後的輸出,而不是越來越怪異的東西像&lt;b&gt;?對不起,我真的很小白的問題,但我真的很感激幫助。

回答

0

使用HTMLParse Python庫,但是這並未」牛逼創建這些CDATA東西

text = """ 
<root> 
    <categories> 
    <category name="Biology" /> 
    </categories> 
    <app> 
    <mainHeader>AP Biology</mainHeader> 
    <questions> 
     <question number="1" title="Biology #1" type="0"> 
     <images /> 
     <category>Biology</category> 
     <description>&lt;b&gt;Which of the following is the site 
     of protein synthesis?&lt;/b&gt;</description> 
     <choices> 
      <choice name="A">Cell wall</choice> 
      <choice correct_answer="true" name="B">Ribosomes</choice> 
      <choice name="C">Vacuoles</choice> 
      <choice name="D">DNA polymerase</choice> 
      <choice name="E">RNA polymerase</choice> 
     </choices> 
     <explanation>&lt;b&gt;Answer:&lt;/b&gt; B, Ribosomes. 
     Translation, the process that converts mRNA code into protein, 
     takes place in ribosomes.&lt;br /&gt;&lt;br /&gt;&lt;b&gt; 
     Key Takeaway: &lt;/b&gt;Ribosomes are complexes of RNA and protein 
     that are located in cell nuclei. Ribosomes catalyze both the 
     conversion of the mRNA code into amino acids as well as the assembly 
     of the individual amino acids into a peptide change that becomes 
     a protein.</explanation> 
     </question> 
     <question number="1" title="Biology #1" type="0"> 
     <images> 
     </images> 
     <category>Biology</category> 
     <description>&lt;b&gt;The site of cellular respiration is:&lt;/b&gt; 
     </description> 
     <choices> 
      <choice name="A">DNA polymerase</choice> 
      <choice name="B">Ribosomes</choice> 
      <choice correct_answer="true" name="C">Mitochondria</choice> 
      <choice name="D">RNA polymerase</choice> 
      <choice name="E">Vacuoles</choice> 
     </choices> 
     <explanation>&lt;b&gt;Answer:&lt;/b&gt; C, Mitochondria. The 
     mitochondrion (plural mitochondria) is known as the &#8220; 
     powerhouse&#8221; of the cell for its role in energy production. 
     &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Key Takeaway: &lt;/b&gt;The 
     mitochondrion is a membrane-bound organelle found in most 
     eukaryotic cells. The dominant role of the mitochondrion is the 
     production of ATP through cellular respiration, which is dependent 
     on the presence of oxygen. All forms of cellular respiration, 
     glycolysis, Krebs&#8217; cycle, and oxidative phosphorylation, 
     take place within the mitochondria.</explanation> 
     </question> 
    </questions> 
    </app> 
</root> 
""" 

import HTMLParser 
html_parser = HTMLParser.HTMLParser() 
unescaped = html_parser.unescape(text) 

print unescaped 

輸出:。

<root> 
    <categories> 
    <category name="Biology" /> 
    </categories> 
    <app> 
    <mainHeader>AP Biology</mainHeader> 
    <questions> 
     <question number="1" title="Biology #1" type="0"> 
     <images /> 
     <category>Biology</category> 
     <description><b>Which of the following is the site 
     of protein synthesis?</b></description> 
     <choices> 
      <choice name="A">Cell wall</choice> 
      <choice correct_answer="true" name="B">Ribosomes</choice> 
      <choice name="C">Vacuoles</choice> 
      <choice name="D">DNA polymerase</choice> 
      <choice name="E">RNA polymerase</choice> 
     </choices> 
     <explanation><b>Answer:</b> B, Ribosomes. 
     Translation, the process that converts mRNA code into protein, 
     takes place in ribosomes.<br /><br /><b> 
     Key Takeaway: </b>Ribosomes are complexes of RNA and protein 
     that are located in cell nuclei. Ribosomes catalyze both the 
     conversion of the mRNA code into amino acids as well as the assembly 
     of the individual amino acids into a peptide change that becomes 
     a protein.</explanation> 
     </question> 
     <question number="1" title="Biology #1" type="0"> 
     <images> 
     </images> 
     <category>Biology</category> 
     <description><b>The site of cellular respiration is:</b> 
     </description> 
     <choices> 
      <choice name="A">DNA polymerase</choice> 
      <choice name="B">Ribosomes</choice> 
      <choice correct_answer="true" name="C">Mitochondria</choice> 
      <choice name="D">RNA polymerase</choice> 
      <choice name="E">Vacuoles</choice> 
     </choices> 
     <explanation><b>Answer:</b> C, Mitochondria. The 
     mitochondrion (plural mitochondria) is known as the 「 
     powerhouse」 of the cell for its role in energy production. 
     <br /><br /><b>Key Takeaway: </b>The 
     mitochondrion is a membrane-bound organelle found in most 
     eukaryotic cells. The dominant role of the mitochondrion is the 
     production of ATP through cellular respiration, which is dependent 
     on the presence of oxygen. All forms of cellular respiration, 
     glycolysis, Krebs’ cycle, and oxidative phosphorylation, 
     take place within the mitochondria.</explanation> 
     </question> 
    </questions> 
    </app> 
</root> 
1

CDATA是專爲數據的XML解析器應該忽略我覺得北京譜儀牛逼你就可以在這種情況下做的,那麼,是捕捉文字,像這樣:

>>> element = et.fromstring('''<explanation><![CDATA[<b>Answer:</b> B, Ribosomes. Translation, 
     the process that converts mRNA code into protein, takes place in 
     ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes 
     of RNA and protein that are located in cell nuclei. Ribosomes 
     catalyze both the conversion of the mRNA code into amino acids as 
     well as the assembly of the individual amino acids into a peptide 
     change that becomes a protein.]]></explanation>''') 
>>> element.text 
'<b>Answer:</b> B, Ribosomes. Translation, \n  the process that converts mRNA code into protein, takes place in \n  ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes \n  of RNA and protein that are located in cell nuclei. Ribosomes \n  catalyze both the conversion of the mRNA code into amino acids as \n  well as the assembly of the individual amino acids into a peptide \n  change that becomes a protein.' 

然後你就可以反轉義的文本作爲@praveen建議。