0
我有一個260 RTI應用程序的數據集。我應該對他們執行LDA。我使用tm和RTextTools軟件包創建了term-doc矩陣。但是,輸出差別很大。 Tm軟件包不顯示任何稀疏的條目數量。總條款數量差別很大。 下面是代碼:爲什麼tm包和RTextTools包的輸出不同?
library("tm")
library("RTextTools")
<I read the data here into a variable called 'data'>
doc = Corpus(VectorSource(data))
m = create_matrix(data, language = "english", removeNumbers = TRUE, removePunctuation = TRUE, stemWords = TRUE, weighting = weightTf) #RtextTools statement
tdm <- TermDocumentMatrix(doc, control = list(removePunctuation = TRUE, removeNumbers = TRUE, language = "english", stemWords = TRUE, stopWords = TRUE, weighting = weightTf) #tm statement
>m
#<<DocumentTermMatrix (documents: 260, terms: 951)>>
Non-/sparse entries: 2669/244591
Sparsity : 99%
>tdm
#<<TermDocumentMatrix (terms: 1024, documents: 1)>>
Non-/sparse entries: 1024/0
Sparsity : 0%
如果您需要的數據集來理解這個問題更好,讓我知道。
所以你建議使用VCorpus? – BlackSwan
@HimabinduBoddupalli是的。 – lukeA
doc = VCorpus(VectorSource(data)) tdm < - TermDocumentMatrix(doc,control = list(language =「english」,removeNumbers = TRUE,removePuncutation = TRUE,stemming = TRUE,stopWords = TRUE,weighting = weightTf))Still不起作用。 – BlackSwan