我上的文本數據做一個LDA,使用例如here: 我的問題是:
我怎樣才能知道哪些文件對應於哪些話題? 換句話說,例如什麼文件談論話題1?蟒蛇scikit學習,讓每個主題文檔LDA
這裏是我的步驟:
n_features = 1000
n_topics = 8
n_top_words = 20
我讀我的文本文件一行一行:
with open('dataset.txt', 'r') as data_file:
input_lines = [line.strip() for line in data_file.readlines()]
mydata = [line for line in input_lines]
功能打印主題:
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("TopiC#%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
做一個對數據的矢量化:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b',
max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)
初始化LDA:
lda = LatentDirichletAllocation(n_topics=3, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
在TF數據運行LDA:
lda.fit(tf)
用上面的功能打印的結果:
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
的輸出打印是:
Topics in LDA model:
TopiC#0:
solar road body lamp power battery energy beacon
TopiC#1:
skin cosmetic hair extract dermatological aging production active
TopiC#2:
cosmetic oil water agent block emulsion ingredients mixture