python - sklearn潛在Dirichlet分配變換與Fittransform

我使用sklearn的NMF和LDA子模塊來分析未標記的文本。我閱讀文檔，但我不確定這些模塊（NMF和LDA）中的變換函數是否與R的主題模型中的後驗函數相同（請參閱Predicting LDA topics for new data）。基本上，我正在尋找一種功能，可以讓我使用訓練集數據訓練的模型來預測測試集中的主題。我預測了整個數據集的主題。然後我將數據分成火車和測試集，在火車集上訓練模型並使用該模型轉換測試集。雖然預計我不會得到相同的結果，但比較兩個運行主題並不能保證我的轉換函數與R包的功能相同。我會很感激你的迴應。python - sklearn潛在Dirichlet分配變換與Fittransform

謝謝

來源

2016-11-14 valearner

你在使用scikit學習的版本？ –

也表示結果不同？ –

謝謝米哈伊爾，v 0.18。我的目標是瞭解轉換函數是否提供預測測試集中主題的功能。謝謝 – valearner

在LatentDirichletAllocation模型transform調用返回的非標準化的文檔主題分佈。爲了得到適當的概率，你可以簡單地歸一化結果。這裏有一個例子：

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.decomposition import LatentDirichletAllocation 
from sklearn.datasets import fetch_20newsgroups 
import numpy as np 

# grab a sample data set 
dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes')) 
train,test = dataset.data[:100], dataset.data[100:200] 

# vectorizer the features 
tf_vectorizer = TfidfVectorizer(max_features=25) 
X_train = tf_vectorizer.fit_transform(train) 

# train the model 
lda = LatentDirichletAllocation(n_topics=5) 
lda.fit(X_train) 

# predict topics for test data 
# unnormalized doc-topic distribution 
X_test = tf_vectorizer.transform(test) 
doc_topic_dist_unnormalized = np.matrix(lda.transform(X_test)) 

# normalize the distribution (only needed if you want to work with the probabilities) 
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

要找到世界排名第一的話題，你可以這樣做：

doc_topic_dist.argmax(axis=1)

來源

2016-11-16 16:33:22

謝謝Ryan，我在想：NMF模型和LDA我相信至少，lda模塊（不是sklearn）產生兩個矩陣W和H.可以預測第一個X_test = tf_vectorizer設置的測試數據.transform（test）然後X_test * HT？ – valearner

python - sklearn潛在Dirichlet分配變換與Fittransform

回答

相關問題