2
我目前使用CountVectorizer
設置了分類器MultinomialNB()
,用於從文本文檔中提取特徵,儘管這很有效,但我希望使用相同的方法預測前3-4名的標籤,而不僅僅是最上面的標籤。sklearn - 根據文本文檔預測多標籤分類中的前3-4個標籤
主要原因是有c.90標籤和數據輸入不是很好,導致最高估計的精度爲35%。如果我可以向用戶提供3-4個最有可能的標籤作爲建議,那麼我可以顯着提高準確度覆蓋率。
有什麼建議嗎?任何指針將不勝感激!
當前的代碼如下所示:
import numpy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, accuracy_score
df = pd.read_csv("data/corpus.csv", sep=",", encoding="latin-1")
df = df.set_index('id')
df.columns = ['class', 'text']
data = df.reindex(numpy.random.permutation(df.index))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('classifier', MultinomialNB())
])
k_fold = KFold(n=len(data), n_folds=6, shuffle=True)
for train_indices, test_indices in k_fold:
train_text = data.iloc[train_indices]['text'].values
train_y = data.iloc[train_indices]['class'].values.astype(str)
test_text = data.iloc[test_indices]['text'].values
test_y = data.iloc[test_indices]['class'].values.astype(str)
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
confusion = confusion_matrix(test_y, predictions)
accuracy = accuracy_score(test_y, predictions)
print accuracy
偉大 - 不知道這會是這麼簡單... – koend