如何使用Python解析一個巨大的xml文件（在旅途中）

我有一個巨大的xml文件（當前爲wikipedia dump）。這個大小約爲45 GB的xml代表了當前維基百科的整個數據。該文件的前幾行（多輸出）：如何使用Python解析一個巨大的xml文件（在旅途中）

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://ww 
    w.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/x 
    ml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:la 
    ng="en"> 
     <siteinfo> 
     <sitename>Wikipedia</sitename> 
     <base>http://en.wikipedia.org/wiki/Main_Page</base> 
     <generator>MediaWiki 1.21wmf6</generator> 
     <case>first-letter</case> 
     <namespaces> 
      <namespace key="-2" case="first-letter">Media</namespace> 
      <namespace key="-1" case="first-letter">Special</namespace> 
      <namespace key="0" case="first-letter" /> 
      <namespace key="1" case="first-letter">Talk</namespace> 
      <namespace key="2" case="first-letter">User</namespace> 
      <namespace key="3" case="first-letter">User talk</namespace> 
      <namespace key="4" case="first-letter">Wikipedia</namespace> 
      <namespace key="5" case="first-letter">Wikipedia talk</namespace> 
      <namespace key="6" case="first-letter">File</namespace> 
      <namespace key="7" case="first-letter">File talk</namespace> 
      <namespace key="8" case="first-letter">MediaWiki</namespace> 
      <namespace key="9" case="first-letter">MediaWiki talk</namespace> 
      <namespace key="10" case="first-letter">Template</namespace> 
      <namespace key="11" case="first-letter">Template talk</namespace> 
      <namespace key="12" case="first-letter">Help</namespace> 
      <namespace key="13" case="first-letter">Help talk</namespace> 
      <namespace key="14" case="first-letter">Category</namespace> 
      <namespace key="15" case="first-letter">Category talk</namespace> 
      <namespace key="100" case="first-letter">Portal</namespace> 
      <namespace key="101" case="first-letter">Portal talk</namespace> 
      <namespace key="108" case="first-letter">Book</namespace> 
      <namespace key="109" case="first-letter">Book talk</namespace> 
      <namespace key="446" case="first-letter">Education Program</namespace> 
      <namespace key="447" case="first-letter">Education Program talk</namespace 
    > 
      <namespace key="710" case="first-letter">TimedText</namespace> 
      <namespace key="711" case="first-letter">TimedText talk</namespace> 
     </namespaces> 
     </siteinfo> 
     <page> 
     <title>AccessibleComputing</title> 
     <ns>0</ns> 
     <id>10</id> 
     <redirect title="Computer accessibility" /> 
     <revision> 
      <id>381202555</id> 
      <parentid>381200179</parentid> 
      <timestamp>2010-08-26T22:38:36Z</timestamp> 
      <contributor> 
      <username>OlEnglish</username> 
      <id>7181920</id> 
      </contributor> 
      <minor /> 
      <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.2 
    8.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by 
    Gurch</comment> 
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from C 
    amelCase}}</text> 
      <sha1>lo15ponaybcg2sf49sstw9gdjmdetnk</sha1> 
      <model>wikitext</model>

...等等

注意在樹中頁面元素。它對應於維基百科中的獨特頁面。給定的XML由頁面元素形式的維基百科的所有頁面組成。我需要編寫一個解析器，其中我需要從頁面中爲所有維基百科頁面提取標題條目的值，並假設（爲了簡單起見）將它們打印出來。

我正在嘗試使用Python構建相同的應用程序（儘管如果提供解決方案，我可以使用語言切換）。我知道的唯一方法是使用ElementTree。

但是，使用函數解析（'file.xml'）需要先完整解析整個文檔，然後才能輸出任何結果。很明顯，我知道整個xml由頁面元素組成。我希望程序在解析xml的其餘部分時開始打印標題。這甚至是可能的。如果是這樣，怎麼樣？

編輯注：我舉了一個例子，在這裏提取標題，以保持事情簡單的問題。但是，我確實需要xml解析功能，因爲我需要在將來提取相同的功能。

來源

2013-04-08 Saurabh Agarwal

相關：http://stackoverflow.com/questions/3707155/can-python-xml-elementtree-parse-a-very-large-xml-file – 2013-04-08 23:57:33

你想要的是一個基於事件的XML庫，它在增量分析時向你發送片斷，而不是爲整個文檔創建一棵樹。典型的答案是xml.sax stdlib module，但我確定還有很多其他的。

來源

2013-04-08 23:58:24

當然，這是可能的。用醜陋的方式，你可以通過文本模式中的行讀取文件。然後用正則表達式，或只是簡單的字符串搜索方法（如關鍵字和）作爲過濾器，以獲得在形式的線條

<title>AccessibleComputing</title>

然後，你能拿冠軍，做你想做的。

來源

2013-04-08 23:49:46 Sheng

有缺陷解析與正則表達式XML不計其數;尤其是來自維基百科的那些內容，我敢打賭你會遇到其中的一些。 – 2013-04-08 23:56:44

是的。這就是爲什麼這是一個醜陋的方式。但即使不太確定，我認爲如果迴歸更好，它可以解決。無論如何，XML是基於文本的。但你的方法更好。 – Sheng 2013-04-09 00:03:18

我還沒有試圖使用這麼大的數據集，但我發現lxml module是快速和有用的。

lxml.etree教程here提供了一個可能具有啓發性的示例。

的關鍵段落是：

一個非常重要的用例iterparse（）被解析產生大的XML文件，例如數據庫轉儲。大多數情況下，這些XML格式只有一個主數據項元素直接掛在根節點下面，並且重複數千次。在這種情況下，最好的做法是讓lxml.etree進行樹構建，並且僅使用正常的樹API進行數據提取，然後僅截取這一個元素。

來源

2013-04-09 01:59:19 bbayles

如何使用Python解析一個巨大的xml文件（在旅途中）

回答

相關問題