I'm在STM模式工作（topicmodelling）和我倒是喜歡評估和驗證模型，但我不確定如何做到這一點。我的代碼是：評估STM模式

Corpus.STM <- readCorpus(dtm, type = "slam")

型號選擇：

BestM1. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(10,20, 30, 40, 50, 60), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land) 
BestM2. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(85,110), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land) 
BestM3. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(20,21,22,23,24,25,26,27,28,29,30), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land) 

str(BestM1.) 
plot.searchK(BestM1.) 
plot.STM(BestM2) 
plot.searchK(BestM3.) 
#27 seems to be a good choice 
#Heldout 
set.seed(1) 
heldout<- make.heldout(Corpus.STM$documents, Corpus.STM$vocab, proportion = .5,seed = 1) 
stm.mod1 <- stm(heldout$documents, heldout$vocab, K =27, seed = 1, init.type = "Spectral", max.em.its = 100) 
heldout.evaluation <- eval.heldout(stm.mod1, heldout$missing) 
heldout.evaluation 
#evaluation heldout 
labelTopics(stm.mod1) 
plot.STM(stm.mod1, type="labels", n=5, frexweight = 0.25) 
cloud(stm.mod1, topic=5) 
plot.STM(stm.mod1, type="summary", labeltype="frex", topics=c(1:5), n=8)

我不確定如何解釋「eval.heldout」的輸出。另外我想確保模型不會過度適應，但我不確定它是如何工作的。

來源

2017-01-02 S.Weigel

eval.heldout（）計算使用文檔完成持有了數似然。你想要的數字是持有的..evaluation $ expected.heldout，它是每個文檔的外延對數似然值的平均值。不幸的是，這個模型是否「過度使用」並沒有明確的標準。 plot.searchK（）調用你會給你一個關於K的不同值的持續對數似然圖，當然如果這個數字隨着K的增加而下降，那麼一個解釋就是過度擬合。

對不起，沒有更明確的答案，但遺憾的是沒有硬性規定在這裏。

來源

2018-01-08 18:24:15 bstewart

評估STM模式

型號選擇：

回答

相關問題