lxml和fast_iter吃掉所有內存

我想在OS X（10.8.2）上使用lxml（3.2.0）用Python解析1.6 GB XML文件（2.7.2）。因爲我已經閱讀了有關內存消耗的潛在問題，所以我已經在其中使用了fast_iter，但是在主循環之後，它消耗了大約8 GB RAM，即使它不保留實際XML文件中的任何數據。lxml和fast_iter吃掉所有內存

from lxml import etree 

def fast_iter(context, func, *args, **kwargs): 
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ 
    # Author: Liza Daly 
    for event, elem in context: 
     func(elem, *args, **kwargs) 
     elem.clear() 
     while elem.getprevious() is not None: 
      del elem.getparent()[0] 
    del context 

def process_element(elem): 
    pass 

context = etree.iterparse("sachsen-latest.osm", tag="node", events=("end",)) 
fast_iter(context, process_element)

我不明白，爲什麼會有如此大規模的泄漏，因爲該元素，整個上下文中fast_iter()，並在那一刻我甚至不處理XML數據被刪除。

任何想法？

來源

2013-05-10 Thomas Skowron

使用此方法解析ODP數據（使用[我的舊版答案]（http://stackoverflow.com/questions/16355421/how-to-extract-information-from-odp-accurately/16355498#16355498））對我來說工作得很好;我看不出超過9MB實際內存，21.3虛擬內存。 – 2013-05-10 12:54:28

解析超過320萬個條目後遇到一個XML char ref錯誤，但是我的'import gc; len（gc.get_objects（））'測試幾乎沒有發現任何跟蹤對象的變化，所以我自己也沒有看到任何泄漏。 – 2013-05-10 12:56:40

問題出在etree.iterparse()的行爲。你會認爲它只爲每個node元素使用內存，但事實證明它仍然保留了內存中的所有其他元素。由於你沒有清除它們，內存最終會炸燬，特別是在解析.osm（OpenStreetMaps）文件和查找節點時，但稍後會更多。

我發現沒有趕上node標籤，但趕上所有標籤的解決方案：

context = etree.iterparse(open(filename,'r'),events=('end',))

，然後清除所有的標籤，但只能分析您所感興趣的那些：

for (event,elem) in progress.bar(context): 
    if elem.tag == 'node': 
     # do things here 

    elem.clear() 
    while elem.getprevious() is not None: 
     del elem.getparent()[0] 
del context

請記住，它可能會刪除您感興趣的其他元素，因此請確保在需要時添加更多ifs。例如（這是.osm專用）從nodes

if elem.tag == 'tag': 
    continue 
if elem.tag == 'node': 
    for tag in elem.iterchildren(): 
     # do stuff

爲什麼內存後來吹起來的原因是很有趣的嵌套tags，.osm文件organized的方式，nodes放在第一位，然後再waysrelations 。因此，您的代碼在開始時可以很好地處理節點，然後內存會被填充，因爲etree會經過其餘的元素。

來源

2014-08-05 09:23:56

lxml和fast_iter吃掉所有內存

回答

相關問題