在LatentDirichletAllocation
模型transform
調用返回的非標準化的文檔主題分佈。爲了得到適當的概率,你可以簡單地歸一化結果。這裏有一個例子:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np
# grab a sample data set
dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'))
train,test = dataset.data[:100], dataset.data[100:200]
# vectorizer the features
tf_vectorizer = TfidfVectorizer(max_features=25)
X_train = tf_vectorizer.fit_transform(train)
# train the model
lda = LatentDirichletAllocation(n_topics=5)
lda.fit(X_train)
# predict topics for test data
# unnormalized doc-topic distribution
X_test = tf_vectorizer.transform(test)
doc_topic_dist_unnormalized = np.matrix(lda.transform(X_test))
# normalize the distribution (only needed if you want to work with the probabilities)
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)
要找到世界排名第一的話題,你可以這樣做:
doc_topic_dist.argmax(axis=1)
你在使用scikit學習的版本? –
也表示結果不同? –
謝謝米哈伊爾,v 0.18。我的目標是瞭解轉換函數是否提供預測測試集中主題的功能。謝謝 – valearner