蟒蛇scikit學習，讓每個主題文檔LDA

我上的文本數據做一個LDA，使用例如here：我的問題是：
我怎樣才能知道哪些文件對應於哪些話題？ 換句話說，例如什麼文件談論話題1？蟒蛇scikit學習，讓每個主題文檔LDA

這裏是我的步驟：

n_features = 1000 
n_topics = 8 
n_top_words = 20

我讀我的文本文件一行一行：

with open('dataset.txt', 'r') as data_file: 
    input_lines = [line.strip() for line in data_file.readlines()] 
    mydata = [line for line in input_lines]

功能打印主題：

def print_top_words(model, feature_names, n_top_words): 
    for topic_idx, topic in enumerate(model.components_): 
     print("TopiC#%d:" % topic_idx) 
     print(" ".join([feature_names[i] 
         for i in topic.argsort()[:-n_top_words - 1:-1]]))       

    print()

做一個對數據的矢量化：

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b', 
           max_features=n_features, 
           stop_words='english') 
tf = tf_vectorizer.fit_transform(mydata)

初始化LDA：

lda = LatentDirichletAllocation(n_topics=3, max_iter=5, 
           learning_method='online', 
           learning_offset=50., 
           random_state=0)

在TF數據運行LDA：

lda.fit(tf)

用上面的功能打印的結果：

print("\nTopics in LDA model:") 
tf_feature_names = tf_vectorizer.get_feature_names() 

print_top_words(lda, tf_feature_names, n_top_words)

的輸出打印是：

Topics in LDA model: 
TopiC#0: 
solar road body lamp power battery energy beacon 
TopiC#1: 
skin cosmetic hair extract dermatological aging production active 
TopiC#2: 
cosmetic oil water agent block emulsion ingredients mixture

來源

2017-07-17 passion

你需要做的數據轉換：

doc_topic = lda.transform(tf)

，並列出這樣的doc和它的最高分主題：

for n in range(doc_topic.shape[0]): 
    topic_most_pr = doc_topic[n].argmax() 
    print("doc: {} topic: {}\n".format(n,topic_most_pr))

來源

2017-07-17 14:56:01 AHC

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation.transform

的變換方法作爲輸入的X.文檔字矩陣X，並返回文檔主題分佈

所以，如果你變換傳遞在每個文檔的，那麼你可以看看這些文件有很高的（足夠用於你的目的）一小部分你感興趣的話題。

來源

2017-07-17 14:54:57

蟒蛇scikit學習，讓每個主題文檔LDA

回答

相關問題