2017-07-25 422 views
1

當我訓練我的LDA模型作爲這樣如何使用gensim LDA獲取文檔的完整主題分佈?

dictionary = corpora.Dictionary(data) 
corpus = [dictionary.doc2bow(doc) for doc in data] 
num_cores = multiprocessing.cpu_count() 
num_topics = 50 
lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, 
workers=num_cores, alpha=1e-5, eta=5e-1) 

我想對所有num_topics爲每一個文件的完整主題分佈。也就是說,在這種特殊情況下,我希望每個文檔都有50個主題用於分發我希望能夠訪問所有50個主題的貢獻。如果嚴格遵守LDA的數學原則,LDA應該做這個輸出。但是,gensim僅輸出超過特定閾值的主題,如圖所示here。例如,如果我嘗試

lda[corpus[89]] 
>>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)] 

這說明只有3主題最有助於記錄89.我曾嘗試在上面的鏈接的解決方案,但是這並沒有爲我工作。我仍然得到同樣的輸出:

theta, _ = lda.inference(corpus) 
theta /= theta.sum(axis=1)[:, None] 

產生相同的輸出,即每個文檔只有2,3主題。

我的問題是如何更改此閾值,所以我可以爲每個文件訪問FULL主題分佈?無論一個主題對文檔的貢獻如何微不足道,我如何訪問完整的主題分發?我想要完整分發的原因是我可以在文檔分發之間執行KL similarity搜索。

在此先感謝

回答

0

它似乎沒有任何人答覆說,所以我會盡量回答這個問題,盡我所能給出的gensim documentation

看來你需要訓練模型時,爲了獲得理想的效果的參數設置minimum_probability爲0.0:

lda = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1, 
       minimum_probability=0.0) 

lda[corpus[233]] 
>>> [(0, 5.8821799358842424e-07), 
(1, 5.8821799358842424e-07), 
(2, 5.8821799358842424e-07), 
(3, 5.8821799358842424e-07), 
(4, 5.8821799358842424e-07), 
(5, 5.8821799358842424e-07), 
(6, 5.8821799358842424e-07), 
(7, 5.8821799358842424e-07), 
(8, 5.8821799358842424e-07), 
(9, 5.8821799358842424e-07), 
(10, 5.8821799358842424e-07), 
(11, 5.8821799358842424e-07), 
(12, 5.8821799358842424e-07), 
(13, 5.8821799358842424e-07), 
(14, 5.8821799358842424e-07), 
(15, 5.8821799358842424e-07), 
(16, 5.8821799358842424e-07), 
(17, 5.8821799358842424e-07), 
(18, 5.8821799358842424e-07), 
(19, 5.8821799358842424e-07), 
(20, 5.8821799358842424e-07), 
(21, 5.8821799358842424e-07), 
(22, 5.8821799358842424e-07), 
(23, 5.8821799358842424e-07), 
(24, 5.8821799358842424e-07), 
(25, 5.8821799358842424e-07), 
(26, 5.8821799358842424e-07), 
(27, 0.99997117731831464), 
(28, 5.8821799358842424e-07), 
(29, 5.8821799358842424e-07), 
(30, 5.8821799358842424e-07), 
(31, 5.8821799358842424e-07), 
(32, 5.8821799358842424e-07), 
(33, 5.8821799358842424e-07), 
(34, 5.8821799358842424e-07), 
(35, 5.8821799358842424e-07), 
(36, 5.8821799358842424e-07), 
(37, 5.8821799358842424e-07), 
(38, 5.8821799358842424e-07), 
(39, 5.8821799358842424e-07), 
(40, 5.8821799358842424e-07), 
(41, 5.8821799358842424e-07), 
(42, 5.8821799358842424e-07), 
(43, 5.8821799358842424e-07), 
(44, 5.8821799358842424e-07), 
(45, 5.8821799358842424e-07), 
(46, 5.8821799358842424e-07), 
(47, 5.8821799358842424e-07), 
(48, 5.8821799358842424e-07), 
(49, 5.8821799358842424e-07)]