2013-05-06 86 views
1

我使用LDA爲2個文本文檔建立主題模型,稱爲A和B.文檔A與計算機科學高度相關,文檔B與地理科學高度相關。然後我訓練使用此命令的LDA:R主題建模:lda模型標註功能

 text<- c(A,B) # introduced above 
    r <- Corpus(VectorSource(text)) # create corpus object 
    r <- tm_map(r, tolower) # convert all text to lower case 
    r <- tm_map(r, removePunctuation) 
    r <- tm_map(r, removeNumbers) 
    r <- tm_map(r, removeWords, stopwords("english")) 
    r.dtm <- TermDocumentMatrix(r, control = list(minWordLength = 3))  
    my_lda <- LDA(r.dtm,2) 

現在我想用my_lda預測新文檔的上下文中說,C和我想看看它是否涉及計算機科學或地理科學。我知道如果我使用此代碼進行預測

 x<-C# a new document (a long string) introduced above for prediction 
    rp <- Corpus(VectorSource(x)) # create corpus object 
    rp <- tm_map(rp, tolower) # convert all text to lower case 
    rp <- tm_map(rp, removePunctuation) 
    rp <- tm_map(rp, removeNumbers) 
    rp <- tm_map(rp, removeWords, stopwords("english")) 
    rp.dtm <- TermDocumentMatrix(rp, control = list(minWordLength = 3))  
    test.topics <- posterior(my_lda,rp.dtm) 

它將給我一個標籤1或2,我沒有任何想法是什麼1或2代表......我怎樣才能實現,如果它意味着計算機科學相關或地理科學相關?

+0

你使用什麼軟件包? – Carson 2013-05-06 14:47:12

+0

tm和topicmodels – 2013-05-06 19:44:07

回答

1

您可以從您的LDA主題模型中提取最可能的術語,並用您希望的多數替換那些黑盒數字名稱。你的例子是不可複製的,但這裏舉例說明你如何做到這一點:

> library(topicmodels) 
> data(AssociatedPress) 
> 
> train <- AssociatedPress[1:100] 
> test <- AssociatedPress[101:150] 
> 
> train.lda <- LDA(train,2) 
> 
> #returns those black box names 
> test.topics <- posterior(train.lda,test)$topics 
> head(test.topics) 
       1   2 
[1,] 0.57245696 0.427543038 
[2,] 0.56281568 0.437184320 
[3,] 0.99486888 0.005131122 
[4,] 0.45298547 0.547014530 
[5,] 0.72006712 0.279932882 
[6,] 0.03164725 0.968352746 
> #extract top 5 terms for each topic and assign as variable names 
> colnames(test.topics) <- apply(terms(train.lda,5),2,paste,collapse=",") 
> head(test.topics) 
    percent,year,i,new,last new,people,i,soviet,states 
[1,]    0.57245696    0.427543038 
[2,]    0.56281568    0.437184320 
[3,]    0.99486888    0.005131122 
[4,]    0.45298547    0.547014530 
[5,]    0.72006712    0.279932882 
[6,]    0.03164725    0.968352746 
> #round to one topic if you'd prefer 
> test.topics <- apply(test.topics,1,function(x) colnames(test.topics)[which.max(x)]) 
> head(test.topics) 
[1] "percent,year,i,new,last" "percent,year,i,new,last" "percent,year,i,new,last" 
[4] "new,people,i,soviet,states" "percent,year,i,new,last" "new,people,i,soviet,states"