概率分析器的內存使用情況

我正在爲範圍連接語法寫一個CKY分析器。我想用樹庫作爲語法，所以語法會很大。我用Python編寫了一個原型1，當我模擬幾十個句子的樹庫時，它似乎工作的很好，但內存使用情況是不可接受的。我試着用C++編寫它，但到目前爲止，一直非常令人沮喪，因爲我之前從未使用過C++。下面是一些數據（n是句子中的語法是基於數）：概率分析器的內存使用情況

n mem 
9 173M 
18 486M 
36 836M

這種增長模式是什麼，是得到最好優先算法可以預期的，但開銷的金額是讓我憂心忡忡。根據heapy的內存使用情況是比這些數字小10倍的因素，valgrind報告了類似的情況。導致這種差異的原因是什麼，我可以在Python（或Cython）中做些什麼？也許這是由於碎片化？或者，也許這是python字典的開銷？

一些背景：兩個重要的數據結構是將概率映射到邊界的議程，以及圖表，這是一個將非終結符和位置映射到邊界的字典。議程是通過heapdict（內部使用dict和heapq列表）來實現的，圖表包含一個將非終止符和位置映射到邊界的字典。議程經常插入並從中刪除，圖表只能獲取插入和查找。我表示與元組邊緣是這樣的：

(("S", 111), ("NP", 010), ("VP", 100, 001))

字符串是從語法的非終結標籤，位置被編碼爲位掩碼。當組分不連續時，可以有多個位置。所以這個邊緣可以表示對「是瑪麗開心」的分析，其中「是」和「開心」都屬於VP。圖表字典由該邊的第一個元素索引（「S」，111）。情況在一個新的版本我想，希望調換這表示，這將節省內存由於重用：

(("S", "NP", "VP), (111, 100, 011))

我想通了Python將存儲的第一部分只有一次，如果它會發生在不同的位置組合，雖然我不確定這是真的，但無論如何，它似乎沒有任何區別

所以基本上我想知道的是，是否值得追求我的Python實現，包括做Cython的東西和不同的數據結構，或者從頭開始編寫C++是唯一可行的選擇。

UPDATE：經過一些改進後，我不再有內存使用問題。我正在研究優化的Cython版本。爲了提高代碼的效率，我會獎賞最有用的建議。有一個註釋版本在http://student.science.uva.nl/~acranenb/plcfrs_cython.html

1 https://github.com/andreasvc/disco-dop/ - 運行test.py來解析一些句子。需要Python 2.6，nltk和heapdict

來源

2011-03-21 Andreas

我想通了Python將存儲的第一部分，只有一次它是否會與不同的位置

不一定發生在組合：

>>> ("S", "NP", "VP") is ("S", "NP", "VP") 
False

你可能需要intern所有字符串指向非終端，因爲您似乎在rcgrules.py中創建了很多這些字符串。如果你想intern元組，然後把它變成一個字符串第一：

>>> intern("S NP VP") is intern(' '.join('S', 'NP', 'VP')) 
True

否則，你將不得不「複製」的元組，而不是重新構建他們。如果你不熟悉C++，那麼在其中重寫這樣的算法不太可能提供很大的內存優勢，你必須首先評估各種哈希表實現並瞭解其容器中的複製行爲。我發現boost::unordered_map與很多小散列表相當浪費）。

來源

2011-03-21 16:06:18

這是一個不幸的缺點，需要在很多地方分裂。我希望有一個相當於元組的實習生。我已經在我當前的代碼中插入了非終結符。單獨的rcgules.py的內存使用率分別爲67M，78M和100M，因此它不是罪魁禍首。 – Andreas 2011-03-21 16:42:26

@Andreas：因爲這是CKY的變體，你需要明確的議程嗎？（我無法順便運行你的代碼，因爲'dopg.py'丟失了。你的網站上的這個版本不適用於'disco-dop'。） – 2011-03-21 16:51:08

dopg.py在版本庫eodop中，也在我的github上。在其他語法分析器中，您可以從左到右瀏覽句子，但是這種形式主義支持不可能的不連續性和詞序變化，所以我認爲需要議程。無論哪種方式，我直接從出版物中實施這個解析器，儘管他們也討論了有助於的啓發式。但是，我確信由於Python中的某些內容，我的開銷有問題，我只是不知道該怎麼做。我通過將它們存儲在全局字典中來實現元組實習功能，但它沒有幫助。 – Andreas 2011-03-21 17:34:16

您是否嘗試過使用PyPy而不是CPython來運行您的應用程序？

PyPy是一個lot聰明比CPython關於注意共性和避免與不必要的重複相關的內存開銷。

這是值得一試，反正：http://pypy.org/

來源

2011-03-27 14:58:20 ncoghlan

是的，我做了，如果我沒有記錯的話，它消耗的內存是兩倍多。無論哪種方式，我發現了一些新的優化和內存不再是一個問題，但速度是。我正在用Cython嘗試更多的東西。 – Andreas 2011-03-27 19:09:35

對於像這樣的任務，pypy被證明比cpython更快，如果你有內存，那麼它就有時間去研究記憶。 – 2011-03-29 14:57:24

*這是memoisation，而不是記憶。就像拍一張備忘錄一樣。 – forivall 2012-12-18 22:11:45

首先要做到在這種情況下總是簡介：

15147/297 0.032 0.000 0.041 0.000 tree.py:102(__eq__) 
15400/200 0.031 0.000 0.106 0.001 tree.py:399(convert) 
     1 0.023 0.023 0.129 0.129 plcfrs_cython.pyx:52(parse) 
6701/1143 0.022 0.000 0.043 0.000 heapdict.py:45(_min_heapify) 
    18212 0.017 0.000 0.023 0.000 plcfrs_cython.pyx:38(__richcmp__) 
10975/10875 0.017 0.000 0.035 0.000 tree.py:75(__init__) 
    5772 0.016 0.000 0.050 0.000 tree.py:665(__init__) 
     960 0.016 0.000 0.025 0.000 plcfrs_cython.pyx:118(deduced_from) 
    46938 0.014 0.000 0.014 0.000 tree.py:708(_get_node) 
25220/2190 0.014 0.000 0.016 0.000 tree.py:231(subtrees) 
    10975 0.013 0.000 0.023 0.000 tree.py:60(__new__) 
    49441 0.013 0.000 0.013 0.000 {isinstance} 
    16748 0.008 0.000 0.015 0.000 {hasattr}

我首先注意到的是，很少功能是從用Cython模塊本身。其中大部分來自tree.py模塊，也許是瓶頸。

着眼於用Cython邊我看到richcmp功能：

我們可以簡單地通過在方法聲明中添加值的類型進行優化

def __richcmp__(ChartItem self, ChartItem other, int op): 
     ....

這使在價值

ncalls tottime percall cumtime percall filename:lineno(function) 
.... 
18212 0.011 0.000 0.015 0.000 plcfrs_cython.pyx:38(__richcmp__)

添加elif語法而不是單個if if將啓用the switch optimization of cython

if op == 0: return self.label < other.label or self.vec < other.vec 
    elif op == 1: return self.label <= other.label or self.vec <= other.vec 
    elif op == 2: return self.label == other.label and self.vec == other.vec 
    elif op == 3: return self.label != other.label or self.vec != other.vec 
    elif op == 4: return self.label > other.label or self.vec > other.vec 
    elif op == 5: return self.label >= other.label or self.vec >= other.vec

獲得：

17963 0.002 0.000 0.002 0.000 plcfrs_cython.pyx:38(__richcmp__)

試圖找出哪裏是tree.py:399轉換來自我發現裏面DOPG此功能。PY花費這麼長的時間

def removeids(tree): 
""" remove unique IDs introduced by the Goodman reduction """ 
result = Tree.convert(tree) 
for a in result.subtrees(lambda t: '@' in t.node): 
    a.node = a.node.rsplit('@', 1)[0] 
if isinstance(tree, ImmutableTree): return result.freeze() 
return result

現在我不知道，如果在樹中的每個節點是目的ChartItem如果的GetItem值正在其他地方使用，但加入這個變化：

cdef class ChartItem: 
cdef public str label 
cdef public str root 
cdef public long vec 
cdef int _hash 
__slots__ = ("label", "vec", "_hash") 
def __init__(ChartItem self, label, int vec): 
    self.label = intern(label) #.rsplit('@', 1)[0]) 
    self.root = intern(label.rsplit('@', 1)[0]) 
    self.vec = vec 
    self._hash = hash((self.label, self.vec)) 
def __hash__(self): 
    return self._hash 
def __richcmp__(ChartItem self, ChartItem other, int op): 
    if op == 0: return self.label < other.label or self.vec < other.vec 
    elif op == 1: return self.label <= other.label or self.vec <= other.vec 
    elif op == 2: return self.label == other.label and self.vec == other.vec 
    elif op == 3: return self.label != other.label or self.vec != other.vec 
    elif op == 4: return self.label > other.label or self.vec > other.vec 
    elif op == 5: return self.label >= other.label or self.vec >= other.vec 
def __getitem__(ChartItem self, int n): 
    if n == 0: return self.root 
    elif n == 1: return self.vec 
def __repr__(self): 
    #would need bitlen for proper padding 
    return "%s[%s]" % (self.label, bin(self.vec)[2:][::-1])

和mostprobableparse內：

from libc cimport pow 
def mostprobableparse... 
      ... 
    cdef dict parsetrees = <dict>defaultdict(float) 
    cdef float prob 
    m = 0 
    for n,(a,prob) in enumerate(derivations): 
     parsetrees[a] += pow(e,prob) 
     m += 1

我得到：

  189345 function calls (173785 primitive calls) in 0.162 seconds 

    Ordered by: internal time 

    ncalls tottime percall cumtime percall filename:lineno(function) 
6701/1143 0.025 0.000 0.037 0.000 heapdict.py:45(_min_heapify) 
     1 0.023 0.023 0.120 0.120 plcfrs_cython.pyx:54(parse) 
     960 0.018 0.000 0.030 0.000 plcfrs_cython.pyx:122(deduced_from) 
5190/198 0.011 0.000 0.015 0.000 tree.py:102(__eq__) 
    6619 0.006 0.000 0.006 0.000 heapdict.py:67(_swap) 
    9678 0.006 0.000 0.008 0.000 plcfrs_cython.pyx:137(concat)

所以接下來的步驟是優化heapify和deduced_from

deduce_from可以優化多一點：

cdef inline deduced_from(ChartItem Ih, double x, pyCx, pyunary, pylbinary, pyrbinary, int bitlen): 
cdef str I = Ih.label 
cdef int Ir = Ih.vec 
cdef list result = [] 
cdef dict Cx = <dict>pyCx 
cdef dict unary = <dict>pyunary 
cdef dict lbinary = <dict>pylbinary 
cdef dict rbinary = <dict>pyrbinary 
cdef ChartItem Ilh 
cdef double z 
cdef double y 
cdef ChartItem I1h 
for rule, z in unary[I]: 
    result.append((ChartItem(rule[0][0], Ir), ((x+z,z), (Ih,)))) 
for rule, z in lbinary[I]: 
    for I1h, y in Cx[rule[0][2]].items(): 
     if concat(rule[1], Ir, I1h.vec, bitlen): 
      result.append((ChartItem(rule[0][0], Ir^I1h.vec), ((x+y+z, z), (Ih, I1h)))) 
for rule, z in rbinary[I]: 
    for I1h, y in Cx[rule[0][1]].items(): 
     if concat(rule[1], I1h.vec, Ir, bitlen): 
      result.append((ChartItem(rule[0][0], I1h.vec^Ir), ((x+y+z, z), (I1h, Ih)))) 
return result

雖然我有信心，我們可以繼續優化的多我也就到此爲止對這個問題有了深入的瞭解。

一系列單元測試將有助於斷言每個優化不會引入任何細微的錯誤。

一個側面說明，嘗試使用空格而不是製表符。

來源

2011-03-30 12:09:32 fabrizioM

我曾想給你賞金，但現在它已經自動分配:(無論如何，你的建議改善了從7:59到7:44的運行時間，語法爲3600個句子，比我預期的要少一些。我打算使用內聯彙編來完成位操作，並且我應該用cdef類替換所有的元組。謝謝你的建議。 – Andreas 2011-03-30 21:45:35

概率分析器的內存使用情況

回答

相關問題