-2
我想將所有文檔劃分爲10個主題,除了主題的分佈維數和協方差矩陣之外,它與聚合結果一致。
爲什麼主題分佈是9維矢量而不是10,它們的協方差矩陣是9 * 9矩陣而不是10 * 10?主題分佈的不同維度
我使用library(topicmodels)
和函數CTM()
來實現中文主題模型。
我的代碼如下:
library(rJava);
library(Rwordseg);
library(NLP);
library(tm);
library(tmcn)
library(tm)
library(Rwordseg)
library(topicmodels)
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Law.scel","Law");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\NationalInstitution.scel","NationalInstitution");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Place.scel","Place");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Psychology.scel","Psychology");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Politics.scel","Politics");
listDict();
#read file
d.vec <- segmentCN("samgovWithoutID.csv", returnType = "tm")
samgov.segment <- read.table("samgovWithoutID.segment.csv", header = TRUE, fill = TRUE, stringsAsFactors = F, sep = ",",fileEncoding='utf-8')
fix(samgov.segment)
# create DTM(document term matrix)
d.corpus <- Corpus(VectorSource(samgov.segment$content))
inspect(d.corpus[1:10])
d.corpus <- tm_map(d.corpus, removeWords, stopwordsCN())
ctrl <- list(removePunctuation = TRUE, removeNumbers= TRUE, wordLengths = c(1, Inf), stopwords = stopwordsCN(), wordLengths = c(2, Inf))
d.dtm <- DocumentTermMatrix(d.corpus, control = ctrl)
inspect(d.dtm[1:10, 110:112])
# impletment topic models
ctm10<-CTM(d.dtm,k=10, control=list(seed=2014012692))
Terms10 <- terms(ctm10, 10)
Terms10[,1:10]
ctm20<-CTM(d.dtm,k=20, control=list(seed=2014012692))
Terms20 <- terms(ctm20, 20)
Terms20[,1:20]
結果中的R工作室(見突出顯示的部分):
幫助文檔:
請提供[可重現的示例](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)。 – figurine
Thx爲您的評論! – Jeffy