爲什麼tm包和RTextTools包的輸出不同？

我有一個260 RTI應用程序的數據集。我應該對他們執行LDA。我使用tm和RTextTools軟件包創建了term-doc矩陣。但是，輸出差別很大。 Tm軟件包不顯示任何稀疏的條目數量。總條款數量差別很大。下面是代碼：爲什麼tm包和RTextTools包的輸出不同？

library("tm") 
library("RTextTools") 
<I read the data here into a variable called 'data'> 
doc = Corpus(VectorSource(data)) 
m = create_matrix(data, language = "english", removeNumbers = TRUE, removePunctuation = TRUE, stemWords = TRUE, weighting = weightTf) #RtextTools statement 
tdm <- TermDocumentMatrix(doc, control = list(removePunctuation = TRUE, removeNumbers = TRUE, language = "english", stemWords = TRUE, stopWords = TRUE, weighting = weightTf) #tm statement 
>m 
#<<DocumentTermMatrix (documents: 260, terms: 951)>> 
Non-/sparse entries: 2669/244591 
Sparsity   : 99% 
>tdm 
#<<TermDocumentMatrix (terms: 1024, documents: 1)>> 
Non-/sparse entries: 1024/0 
Sparsity   : 0%

如果您需要的數據集來理解這個問題更好，讓我知道。

來源

2017-07-06 BlackSwan

請參閱?termFreq - 它必須是stemming=TRUE, stopwords=TRUE而不是stemWords = TRUE, stopWords = TRUE。另請注意，SimpleCorpus對象觸發TermDocumentMatrix的默認行爲可能會覆蓋您的控制參數。

來源

2017-07-06 12:42:43 lukeA

所以你建議使用VCorpus？ – BlackSwan

@HimabinduBoddupalli是的。 – lukeA

doc = VCorpus（VectorSource（data）） tdm < - TermDocumentMatrix（doc，control = list（language =「english」，removeNumbers = TRUE，removePuncutation = TRUE，stemming = TRUE，stopWords = TRUE，weighting = weightTf））Still不起作用。 – BlackSwan

爲什麼tm包和RTextTools包的輸出不同？

回答

相關問題