2017-04-21 82 views
0

我有一個python代碼,我正在解析xml文件並從中提取所有tags。現在我想提取與tag相關的特定值,但在這樣做中發現了一些問題。我xml文件的示例如下:使用python提取與xml標籤相關聯的值問題

<Cell ss:StyleID="s65"><Data ss:Type="String">Variable Name</Data></Cell> 
    <Cell ss:StyleID="s65"><Data ss:Type="String">Variable Label</Data></Cell> 
    <Cell ss:StyleID="s79"><Data ss:Type="String">Minimum&#10;Value</Data></Cell> 
    <Cell ss:StyleID="s79"><Data ss:Type="String">Maximum&#10;Value</Data></Cell> 
    <Cell ss:StyleID="s80"><Data ss:Type="String">Mean&#10;Value</Data></Cell> 

    <Row ss:AutoFitHeight="0" ss:Height="15"> 
    <Cell ss:StyleID="s73"><Data ss:Type="String">Marks</Data></Cell> 
    <Cell ss:StyleID="s73"><Data ss:Type="String">Marks of Students</Data></Cell> 
    <Cell ss:StyleID="s82"><Data ss:Type="Number">0</Data></Cell> 
    <Cell ss:StyleID="s82"><Data ss:Type="Number">96</Data></Cell> 
    <Cell ss:StyleID="s83"><Data ss:Type="Number">65.71</Data></Cell> 
    </Row> 

現在上面只是一個,我想提取出完整的XML文件的一部分。我寫了這個代碼打印的所有標籤中的XML文件:

import xml.etree.ElementTree 
xmlTree = xml.etree.ElementTree.parse('sample_xml.xml').getroot() 

elemList = [] 

for elem in xmlTree.iter(): 
    elemList.append(elem.tag) # indent this by tab, not two spaces as I did here 

# Just printing out the result 

for element in elemList: 
    print(element) 

現在,當我執行這個代碼,我看到的是下面的示例輸出的重複一串:

{urn:schemas-microsoft-com:office:spreadsheet}Interior 
{urn:schemas-microsoft-com:office:spreadsheet}NumberFormat 
{urn:schemas-microsoft-com:office:spreadsheet}Protection 
{urn:schemas-microsoft-com:office:spreadsheet}Worksheet 
{urn:schemas-microsoft-com:office:spreadsheet}Table 
{urn:schemas-microsoft-com:office:spreadsheet}Column 
{urn:schemas-microsoft-com:office:spreadsheet}Column 
{urn:schemas-microsoft-com:office:spreadsheet}Column 
{urn:schemas-microsoft-com:office:spreadsheet}Column 
{urn:schemas-microsoft-com:office:spreadsheet}Column 
{urn:schemas-microsoft-com:office:spreadsheet}Row 
{urn:schemas-microsoft-com:office:spreadsheet}Cell 
{urn:schemas-microsoft-com:office:spreadsheet}Data 
{urn:schemas-microsoft-com:office:spreadsheet}Row 
{urn:schemas-microsoft-com:office:spreadsheet}Cell 
{urn:schemas-microsoft-com:office:spreadsheet}Data 
{urn:schemas-microsoft-com:office:spreadsheet}Row 
{urn:schemas-microsoft-com:office:spreadsheet}Cell 
{urn:schemas-microsoft-com:office:spreadsheet}Data 
{urn:schemas-microsoft-com:office:spreadsheet}Row 
{urn:schemas-microsoft-com:office:spreadsheet}Cell 
{urn:schemas-microsoft-com:office:spreadsheet}Data 
{urn:schemas-microsoft-com:office:spreadsheet}Row 
{urn:schemas-microsoft-com:office:spreadsheet}Cell 
{urn:schemas-microsoft-com:office:spreadsheet}Data 
{urn:schemas-microsoft-com:office:spreadsheet}Row 
{urn:schemas-microsoft-com:office:spreadsheet}Cell 
{urn:schemas-microsoft-com:office:spreadsheet}Data 

我不知道哪些單元格,數據,行要定位以提取我需要的值(標記,學生的標記,最小值,最大值),如開始時的示例xml格式所示。我怎樣才能做到這一點?

UPDATE:根據建議,我能夠提取使用下面的代碼進行相關的文本:

for elem in xmlTree.iter(): 
    if elem.text != None: 
     print(elem.text) 

現在的問題是,在我的XML文件中有很多不同的文本,但我的想要提取在這4個標籤文本之後出現的4個文本 - MarksMarks of Students,Minimum Marks,Maximum Marks。如果迭代器在我的當前標記與Marks匹配時移動到下一個標記,並且按照該順序繼續匹配下3個標記,但它不產生所需結果,我試圖使用next()。這裏是我寫的:

for elem in xmlTree.iter(): 
    if elem.text == 'Marks': 
     if next(xmlTree.iter()) == 'Marks of Students': 
      if next(xmlTree.iter()) == 'Minimum Value': 
       if next(xmlTree.iter()) == 'Maximum Value': 
        print(next(elem.text)) 
        print(next(elem.text)) 
        print(next(elem.text)) 
        print(next(elem.text)) 
+0

我不能重現使用你的XML的修改使其格式良好的問題。請發佈*最少但完整的*示例XML,以及相應的輸出,以顯示問題... – har07

回答

0

我不能重現你在這裏指定的XML文件的問題。但我懷疑你的XML文件可能是這種格式。

<?xml version="1.0"?> 
<?mso-application progid="Excel.Sheet"?> 
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" 
xmlns:o="urn:schemas-microsoft-com:office:office" 
xmlns:x="urn:schemas-microsoft-com:office:excel" 
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" 
xmlns:html="http://www.w3.org/TR/REC-html40"> 
<Interior/> 
<NumberFormat/> 
<Protection/> 
<Worksheet ss:Name="Sheet1"> 
<Table ss:ExpandedColumnCount="6" ss:ExpandedRowCount="2685" x:FullColumns="1" 
x:FullRows="1"> 
<Column ss:AutoFitWidth="0" ss:Width="26.25"/> 
<Column ss:AutoFitWidth="0" ss:Width="117" ss:Span="3"/> 
<Column ss:Index="6" ss:AutoFitWidth="0" ss:Width="29.25"/> 
<Row ss:AutoFitHeight="0" ss:Height="60"> 
<Cell ss:StyleID="s22"/> 
<Cell ss:StyleID="s23"><Data ss:Type="String">Name</Data></Cell> 
<Cell ss:StyleID="s23"><Data ss:Type="String">UserName</Data></Cell> 
<Cell ss:StyleID="s23"><Data ss:Type="String">Address</Data></Cell> 
<Cell ss:StyleID="s23"><Data ss:Type="String">Telephone Number</Data></Cell> 
<Cell ss:StyleID="s22"/> 
</Row> 
<Row ss:AutoFitHeight="0" ss:Height="30"> 
<Cell ss:StyleID="s22"/> 
<Cell ss:StyleID="s24"><Data ss:Type="String">John Smith</Data></Cell> 
<Cell ss:StyleID="s24"><Data ss:Type="String">JSmith</Data></Cell> 
<Cell ss:StyleID="s24"><Data ss:Type="String">ABC</Data></Cell> 
<Cell ss:StyleID="s24"><Data ss:Type="String">(999) 999-9999</Data></Cell> 
<Cell ss:StyleID="s22"/> 
</Row> 
</Table> 
</Worksheet> 
</Workbook> 

如果這是相同的,那麼你可以使用下面的代碼。

import xml.etree.cElementTree as etree 

with open('sample.xml') as xml_file: 
    tree = etree.iterparse(xml_file) 
    for item in tree: 
     if item[1].text != None: 
      print item[1].text 

我已經使用了下面的參考文件來理解和複製代碼。 Reading Excel xml to dictionary

+0

當我爲xmlTree.iter()中的元素執行操作時:if elem [1] .text!= None:print(elem [1 ] .text)'我得到'IndexError:子索引超出範圍' – user2966197

+0

我能夠解決上述錯誤,但我有一個問題。在我的XML文件中有一堆不同的標籤文本。現在我想要做的是檢查標記文本是否是「標記」,然後檢查下3個標記,看它們是否是「學生的標記,最小標記,最大標記」。如果他們然後提取下4個標籤值,否則繼續。我怎樣才能做到這一點? – user2966197

+0

我已經更新了我的帖子,以反映當前的問題 – user2966197