Nltk樸素貝葉斯分類記憶問題

我的第一篇文章在這裏！我有使用nltk NaiveBayesClassifier的問題。我有一套7000件的訓練。每個培訓項目都有2或3個世界和代碼的描述。我想使用代碼作爲類的標籤和描述的每個世界作爲特徵。一個例子：Nltk樸素貝葉斯分類記憶問題

「我的名字是奧巴馬」，001 ...

訓練集= {[功能[ '我的'] = TRUE，功能[ '名'] = TRUE，功能[」是'] =真，功能[奧巴馬] =真]，001}

不幸的是，使用這種方法，訓練過程NaiveBayesClassifier.train使用高達3 GB的RAM .. 我的方法有什麼問題？謝謝！

def document_features(document): # feature extractor 
document = set(document) 
return dict((w, True) for w in document) 

... 
words=set() 
entries = [] 
train_set= [] 
train_length = 2000 
readfile = open("atcname.pl", 'r') 
t = readfile.readline() 
while (t!=""): 
    t = t.split("'") 
    code = t[0] #class 
    desc = t[1] # description 
    words = words.union(s) #update dictionary with the new words in the description 
    entries.append((s,code)) 
    t = readfile.readline() 
train_set = classify.util.apply_features(document_features, entries[:train_length]) 
classifier = NaiveBayesClassifier.train(train_set) # Training

來源

2012-03-15 Marco

使用nltk.classify.apply_features它返回就像一個列表，但不會存儲在內存中的所有功能集的對象。

from nltk.classify import apply_features

更多信息和實例here

你反正加載文件到內存，您將需要使用某種形式的延遲加載的方法。這將根據需要加載。想看看this

來源

2012-03-15 17:18:56 subiet

謝謝你的建議！我試過，但我沒有在內存使用方面有所改進。使用train_set = classify.util.apply_features（document_features，entries [：1500]），只有1500個項目，我使用1.7GB .... – Marco 2012-03-15 18:27:17

您可以發佈您的火車集的要點和您嘗試使用的確切語法。 apply_features通常很好。 – subiet 2012-03-16 07:34:45

謝謝..更新.. – Marco 2012-03-16 11:16:17

Nltk樸素貝葉斯分類記憶問題

回答

相關問題