2016-04-28 125 views
1

我試圖使用python解析一個巨大的XML文件,但我得到這個錯誤:使用python Iterparsing一個巨大的XML文件,但得到一個錯誤

File "parser.py", line 6, in <module> 
     event, root = text.next() 
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1281, in next 
     self._root = self._parser.close() 
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1654, in close 
     self._raiseerror(v) 
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror 
     raise err 
    xml.etree.ElementTree.ParseError: syntax error: line 1, column 0 

我的代碼現在看起來像這樣

import xml.etree.ElementTree as ET 
    from StringIO import StringIO 

    text = ET.iterparse(StringIO('Posts.xml'), events=('start', 'end', 'start-ns', 'end-ns')) 
    text = iter(text) 
    event, root = text.next() 

    for event, elem in text: 
     currId = elem.get('PostTypeId') 
     if (currId != '1'): 
      root.remove(elem) 

    tree.write('cut.xml') 

的XML文件,我試着解析看起來是這樣的:

<posts> 

    <row FavoriteCount="4" CommentCount="4" AnswerCount="7" Tags="<discussion><answers>" Title="Why would anyone accept an answer?" LastActivityDate="2014-04-23T09:14:37.103" LastEditDate="2010-09-03T00:42:07.733" LastEditorUserId="99" OwnerUserId="4" Body="<p>I'm looking at the questions proposed during the Area 51 process:</p> <ul> <li>My supervisor thinks that all <code>If</code> statements should include <code>else</code> statements. Do you agree?</li> <li>What are common mistakes in Software Development?</li> <li>Tabs vs. Spaces: What is the one proper indentation character for everything, in every situation, ever?</li> <li>What programming language should I teach to my 4 year old son?</li> <li>What was the turning point of your programming career?</li> </ul> <p>None of these have an answer that should be accepted. The questions are interesting, and the answers would also be informative if the answer was well written and explained why the answerer thinks his method or idea is better. But I can't really see being able to accept an answer to any of these questions.</p> <p>So, if I ask a question, how do I decide if or how to accept an answer? There is no right or wrong answer and just because it works for me doesn't mean I should be floating that answer to the top - unless I'm overlooking something, the questions that are on topic here are very subjective. On Stack Overflow, there are often multiple right solutions to a problem. Here, we have a problem with an infinite number of solutions, none of which are arguably better or worse than any others.</p> <p>Thoughts?</p> " ViewCount="1582" Score="30" CreationDate="2010-09-01T19:32:45.710" PostTypeId="1" Id="1"/> 

    <row CommentCount="0" AnswerCount="4" Tags="<discussion><site-attributes><faq-contents><top-7>" Title="What should our FAQ contain?" LastActivityDate="2015-03-18T19:19:24.887" LastEditDate="2015-03-18T19:19:24.887" LastEditorUserId="25936" OwnerUserId="9" Body="<p>One of the big 7 questions.</p> " ViewCount="318" Score="6" CreationDate="2010-09-01T19:34:51.797" PostTypeId="1" Id="2" CommunityOwnedDate="2010-09-02T03:42:26.083"/> 

    <row FavoriteCount="8" CommentCount="8" AnswerCount="32" Tags="<discussion><top-7><site-attributes>" Title="What should our domain name be?" LastActivityDate="2014-04-23T09:14:37.103" LastEditDate="2010-12-20T02:46:31.950" LastEditorUserId="2314" OwnerUserId="9" Body="<blockquote> <p><strong>Possible Duplicate:</strong><br> <a href="http://meta.programmers.stackexchange.com/questions/412/write-an-elevator-pitch-tagline">Write an Elevator Pitch/Tagline</a> </p> </blockquote> <h2>Note:</h2> <p>We are closing this domain naming thread. It is asking the <em>entirely</em> wrong question. See this blog post for details: <a href="http://blog.stackoverflow.com/2010/10/domain-names-the-wrong-question/" rel="nofollow">Domain Names: Wrong Question</a> </p> <p>We're going to keep the name programmers.stackexchange.com. But we WILL be setting up redirects from the more "popular" domains names. (e.g. seasonedadvice.com to cooking.stackexchange.com, basicallymoney.com to money.stackexchange.com, and others as we go through the list).</p> <p>New question: "<strong>Write an Elevator Pitch/Tagline!</strong>"</p> <p><a href="http://meta.programmers.stackexchange.com/questions/412/write-an-elevator-pitch-tagline"><strong>Click here to contribute ideas and vote.</strong></a> </p> <p><em>[original message text below]</em></p> <p>One of the big 7 questions.</p> <ul> <li>One answer per answer please</li> <li>Only .com domain names please</li> <li>Only untaken domain names please (use whois)</li> </ul> <p>Please use <strong>lowercase characters only</strong> in domain name!<br> DomainName.com is more readable, but we have to register domainname.com!</p> " ViewCount="1146" Score="16" CreationDate="2010-09-01T19:36:08.390" PostTypeId="1" Id="3" CommunityOwnedDate="2010-09-02T03:40:00.467" ClosedDate="2010-10-08T21:02:50.313"/> 
    ... 

    </posts> 
+1

在XML的開頭是什麼'-'? – alecxe

+0

林不知道,當我從stackoverflow(它的一個數據集)下載文件時,它就是XML文件的一部分。當我嘗試使用(etree.parse)解析整個文件時,它工作正常。它只在使用iterparse時,它的作用是 – Felixasdf

+0

錯誤表明文件輸入無效,特別是第一行的第一個字節/字符。也許一些解析器對無效輸入更容忍? (違反規範)或者也許這是一個[BOM](https://en.wikipedia.org/wiki/Byte_order_mark),當您發佈它時(並且不由iterparse處理),它不會很好地複製。 – dsh

回答

1

ElementTree.iterparse需要某種來源。您正在提供一個字符串緩衝區,其內容爲Posts.xml,而不是文件Posts.xml的實際內容,它明顯不是xml文件的正確語法。

所以,只需擺脫StringIO調用,ElementTree將爲您處理打開文件。然而,您的輸入文件存在一些問題,這些問題會阻止您的文件被正確解析(請參閱sverasch的答案)。

1

我跑了您的SAM ple xml通過xmllint(http://linux.die.net/man/1/xmllint),並發現你已經轉化爲小於或大於符號。

> < 

應該

&gt; &lt; 

當它的解析,它認爲它已經到了一個新的標籤,或過早結束標記。

1

您沒有正確讀取文件。

​​不讀取文件;它會創建一個內容爲「Posts.xml」的文件類對象。

這就是爲什麼iterparse抱怨;內容不以<開頭。