2017-07-17 86 views
1

我上的文本數據做一個LDA,使用例如here: 我的問題是:
我怎樣才能知道哪些文件對應於哪些話題? 換句話說,例如什麼文件談論話題1?蟒蛇scikit學習,讓每個主題文檔LDA

這裏是我的步驟:

n_features = 1000 
n_topics = 8 
n_top_words = 20 

我讀我的文本文件一行一行:

with open('dataset.txt', 'r') as data_file: 
    input_lines = [line.strip() for line in data_file.readlines()] 
    mydata = [line for line in input_lines] 

功能打印主題:

def print_top_words(model, feature_names, n_top_words): 
    for topic_idx, topic in enumerate(model.components_): 
     print("TopiC#%d:" % topic_idx) 
     print(" ".join([feature_names[i] 
         for i in topic.argsort()[:-n_top_words - 1:-1]]))       

    print() 

做一個對數據的矢量化:

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b', 
           max_features=n_features, 
           stop_words='english') 
tf = tf_vectorizer.fit_transform(mydata) 

初始化LDA:

lda = LatentDirichletAllocation(n_topics=3, max_iter=5, 
           learning_method='online', 
           learning_offset=50., 
           random_state=0) 

在TF數據運行LDA:

lda.fit(tf) 

用上面的功能打印的結果:

print("\nTopics in LDA model:") 
tf_feature_names = tf_vectorizer.get_feature_names() 

print_top_words(lda, tf_feature_names, n_top_words) 

的輸出打印是:

Topics in LDA model: 
TopiC#0: 
solar road body lamp power battery energy beacon 
TopiC#1: 
skin cosmetic hair extract dermatological aging production active 
TopiC#2: 
cosmetic oil water agent block emulsion ingredients mixture 

回答

5

你需要做的數據轉換:

doc_topic = lda.transform(tf) 

,並列出這樣的doc和它的最高分主題:

for n in range(doc_topic.shape[0]): 
    topic_most_pr = doc_topic[n].argmax() 
    print("doc: {} topic: {}\n".format(n,topic_most_pr))