我寫了一個簡單的文檔分類器,目前我正在布朗語料庫上測試它。但是,我的準確度仍然很低(0.16)。我已經排除了停用詞。關於如何提高分類器性能的其他想法?提高準確性樸素貝葉斯分類器
import nltk, random
from nltk.corpus import brown, stopwords
documents = [(list(brown.words(fileid)), category)
for category in brown.categories()
for fileid in brown.fileids(category)]
random.shuffle(documents)
stop = set(stopwords.words('english'))
all_words = nltk.FreqDist(w.lower() for w in brown.words() if w in stop)
word_features = list(all_words.keys())[:3000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
我想有一個與代碼版中的問題,似乎有兩行分類= NLTK之前評論...正在要求。順便說一句,這不使用樸素貝葉斯分類器,而是一個決策樹分類器,所以你應該改變標籤和標題。 –
你不排除停用詞,你只包括他們。 變化:' 到 'all_words = nltk.FreqDist(w.lower 'all_words = nltk.FreqDist(爲w的brown.words()當w在停止w.lower)爲w的棕色。文字()如果W不在停止)' –