2010-04-06 148 views
2

我已經成功地調試了自己的內存泄漏問題。但是,我注意到一些非常奇怪的事件。Python內存泄漏 - 解決了,但仍然困惑

for fid, fv in freqDic.iteritems(): 
     outf.write(fid+"\t")    #ID 
     for i, term in enumerate(domain): #Vector 
      tfidf = self.tf(term, fv) * self.idf(term, docFreqDic) 
      if i == len(domain) - 1: 
       outf.write("%f\n" % tfidf) 
      else: 
       outf.write("%f\t" % tfidf) 
     outf.flush() 
     print "Memory increased by", int(self.memory_mon.usage()) - startMemory 

    outf.close() 

def tf(self, term, freqVector): 
    total = freqVector[TOTAL] 
    if total == 0: 
     return 0 
    if term not in freqVector:  ## When you don't have these lines memory leaks occurs 
     return 0     ## 
    return float(freqVector[term])/freqVector[TOTAL] 


def idf(self, term, docFrequencyPerTerm): 
    if term not in docFrequencyPerTerm: 
     return 0   
    return math.log(float(docFrequencyPerTerm[TOTAL])/docFrequencyPerTerm[term]) 

基本上讓我描述我的問題: 1)我做TFIDF計算 2)我跟蹤內存泄漏的根源是從defaultdict到來。我使用memory_mon從How to get current CPU and RAM usage in Python? 4)我的內存泄漏的原因如下:a)在self.tf中,如果行:if項不在freqVector:return 0中未添加會導致內存泄漏。 (我使用memory_mon驗證了這一點,並注意到內存的急劇增加不斷增加)

我的問題的解決方案是1)由於fv是defaultdict,所以在fv中找不到它的任何引用都會創建條目。在非常大的域中,這會導致內存泄漏。

我決定使用dict而不是默認的dict,並且內存問題確實消失了。我的唯一難題是:因爲fv是在fid中創建的,所以在freqDic.iteritems()中使用fv:「不應該在每個for循環的末尾被銷燬?我試着把gc.collect()放在for循環的末尾,但gc不能收集所有東西(返回0)。是的,這個假設是正確的,但是如果for循環會破壞所有的臨時變量,那麼內存應該保持與循環相當一致。

這是它看起來像在self.tf兩個行:

Memory increased by 12 
Memory increased by 948 
Memory increased by 28 
Memory increased by 36 
Memory increased by 36 
Memory increased by 32 
Memory increased by 28 
Memory increased by 32 
Memory increased by 32 
Memory increased by 32 
Memory increased by 40 
Memory increased by 32 
Memory increased by 32 
Memory increased by 28 

,並沒有兩行:

Memory increased by 1652 
Memory increased by 3576 
Memory increased by 4220 
Memory increased by 5760 
Memory increased by 7296 
Memory increased by 8840 
Memory increased by 10456 
Memory increased by 12824 
Memory increased by 13460 
Memory increased by 15000 
Memory increased by 17448 
Memory increased by 18084 
Memory increased by 19628 
Memory increased by 22080 
Memory increased by 22708 
Memory increased by 24248 
Memory increased by 26704 
Memory increased by 27332 
Memory increased by 28864 
Memory increased by 30404 
Memory increased by 32856 
Memory increased by 33552 
Memory increased by 35024 
Memory increased by 36564 
Memory increased by 39016 
Memory increased by 39924 
Memory increased by 42104 
Memory increased by 42724 
Memory increased by 44268 
Memory increased by 46720 
Memory increased by 47352 
Memory increased by 48952 
Memory increased by 50428 
Memory increased by 51964 
Memory increased by 53508 
Memory increased by 55960 
Memory increased by 56584 
Memory increased by 58404 
Memory increased by 59668 
Memory increased by 61208 
Memory increased by 62744 
Memory increased by 64400 

我期待着你的答案

編輯: 看來,我的術語可能是錯誤的(或似乎是錯誤的)。

  1. 我指的內存泄漏不是從freqVector [term]生成的。 (在defaultdict中查找不存在的鍵)。
  2. 我在說的實際內存泄漏是從for fid, fv in freqDic.iteritems()內存泄漏!我知道由於1)fv的尺寸增加了,但在循環結束時它仍然應該被銷燬!內存不應該繼續擴大。這不是內存泄漏?

回答

2

freqDict進行迭代不會生成新值,但會將引用傳遞給已由dict保存的值。這意味着即使在循環之後,您也可以向freqDict保持的fv添加新值。

另一個解決方案是在循環結束後清除freqDict。

一般來說,Python確實通過引用傳遞了所有內容,儘管它有時會以其他方式出現。字符串和整數是不可變的,如果它們被改變,它們所代表的對象將被替換。

+0

謝謝。這就說得通了。 – disappearedng 2010-04-06 15:14:13

0

這不是內存泄漏,因爲內存沒有泄漏,它是由你的默認詞典例如

from collections import defaultdict 

d = defaultdict(int) 
for i in xrange(10**7): 
    a = d[i] 

你認爲這是內存泄漏嗎?你正在給一個字典賦值並且內存使用量會因爲它而增加,所以它類似於這個

d = {} 
for i in xrange(10**7): 
    d[i] = 0 

這不是內存泄漏。

+0

請閱讀我的編輯評論 – disappearedng 2010-04-06 15:13:15

1

我懷疑Python的內存使用量可能會增加,因爲浮點數也是Python中的對象,並且解釋器維護着一個無限且不朽的浮點數freelist。因此,每當float計算結果產生一個以前沒有發生的新float時,Python就會在freelist中分配一個新的float對象,然後它保留該對象以防以後可能需要它。

請參閱Python bug跟蹤器here中的類似討論。