如何從大型XML文檔獲取流式Iterator [Node]？

我需要處理由大量獨立記錄組成的XML文檔，例如，如何從大型XML文檔獲取流式Iterator [Node]？

<employees> 
    <employee> 
     <firstName>Kermit</firstName> 
     <lastName>Frog</lastName> 
     <role>Singer</role> 
    </employee> 
    <employee> 
     <firstName>Oscar</firstName> 
     <lastName>Grouch</lastName> 
     <role>Garbageman</role> 
    </employee> 
    ... 
</employees>

在某些情況下，這些都只是大文件，但其他人，他們可能來自數據流源。

我不能只是scala.xml.XmlLoader.load（）它，因爲我不想保存整個文檔在內存中（或等待輸入流關閉），當我只需要工作一次一個記錄。我知道我可以使用XmlEventReader將輸入流作爲一系列XmlEvent進行流式處理。然而，這些工作比scala.xml.Node更不方便。

所以我想獲得一個懶惰的迭代器[節點]出這個不知何故，爲了使用上的方便Scala的語法每一個人記錄進行操作，同時保持控制下的內存使用情況。

要做到這一點我自己，我可以XMLEventReader的開始，建立每個匹配的開始和結束標記之間的事件的緩衝區，然後從構建一個節點樹。但是，有沒有更容易忽視的方法？感謝任何見解！

來源

2011-12-15 David Soergel

您可以使用XMLEventReader到ConstructingParser使用的底層解析器，並使用回調處理您的員工節點在頂層以下。你必須儘快處理，要小心刪除數據：

import scala.xml._ 

def processSource[T](input: Source)(f: NodeSeq => T) { 
    new scala.xml.parsing.ConstructingParser(input, false) { 
    nextch // initialize per documentation 
    document // trigger parsing by requesting document 

    var depth = 0 // track depth 

    override def elemStart(pos: Int, pre: String, label: String, 
     attrs: MetaData, scope: NamespaceBinding) { 
     super.elemStart(pos, pre, label, attrs, scope) 
     depth += 1 
    } 
    override def elemEnd(pos: Int, pre: String, label: String) { 
     depth -= 1 
     super.elemEnd(pos, pre, label) 
    } 
    override def elem(pos: Int, pre: String, label: String, attrs: MetaData, 
     pscope: NamespaceBinding, nodes: NodeSeq): NodeSeq = { 
     val node = super.elem(pos, pre, label, attrs, pscope, nodes) 
     depth match { 
     case 1 => <dummy/> // dummy final roll up 
     case 2 => f(node); NodeSeq.Empty // process and discard employee nodes 
     case _ => node // roll up other nodes 
     } 
    } 
    } 
}

然後你使用這樣的處理在固定存儲器中的第二級的每個節點（假設在第二級的節點都沒有得到一個孩子的任意數）：

processSource(src){ node => 
    // process here 
    println(node) 
}

相比XMLEventReader的好處是，你不使用兩個線程。與建議的解決方案相比，您也不必解析節點兩次。缺點是這依賴於ConstructingParser的內部工作。

來源

2011-12-16 04:31:51 huynhjl

輝煌！這很好。從這個生成器風格的東西到一個迭代器並不難;看到我的其他答案。非常感謝！ – 2011-12-16 18:21:04

從huynhjl的發電機解決了TraversableOnce[Node]獲取，使用this trick：

def generatorToTraversable[T](func: (T => Unit) => Unit) = 
    new Traversable[T] { 
    def foreach[X](f: T => X) { 
     func(f(_)) 
    } 
    } 

def firstLevelNodes(input: Source): TraversableOnce[Node] = 
    generatorToTraversable(processSource(input))

generatorToTraversable的結果是不可遍歷超過一次（即使新ConstructingParser在每個的foreach調用實例化），因爲輸入流是一個源，它是一個迭代器。不過，我們不能重寫Traversable.isTraversableAgain，因爲它是最終的。

真的，我們想通過只返回一個迭代器來執行此操作。但是，Traversable.toIterator和Traversable.view.toIterator都會創建一箇中間流，它將緩存所有條目（破壞本練習的全部目的）。好吧;如果訪問了兩次，我會讓流引發異常。

還要注意整個事情是不是線程安全的。

此代碼運行非常好，我相信整體解決方案既懶惰和不緩存（因此常量內存），雖然我還沒有嘗試過在一個大的輸入呢。

來源

2011-12-16 18:33:37

我不知道這個真棒技巧！ – huynhjl 2011-12-17 02:37:52

如何從大型XML文檔獲取流式Iterator [Node]？

回答

相關問題