2017-10-19 116 views
2

我剛剛在R中開始使用tm包,似乎無法解決問題。 雖然我的分詞器的功能似乎工作權:R中的TermDocumentMatrix - 僅創建1剋剋

uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1)) 
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2)) 
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3)) 

uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize = uniTokenizer)) 
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer)) 
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize = triTokenizer)) 

當我試圖拉2克從biTDM,只有1克拿出...

findFreqTerms(biTDM, 50) 

[1] "after" "and"  "most" "the"  "were" "years" "love" 
[8] "you"  "all"  "also" "been" "did"  "from" "get"  

的同時, 2克的功能似乎是在機智:

x <- biTokenizer(corpus) 
head(x) 

[1] "c in"    "in the"   "the years"  
[4] "years thereafter" "thereafter most" "most of"  
+2

包括[最小再現的示例](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)在你的問題會增加你的機會得到答案。 – jsb

回答

0

我只能假設是什麼問題在這裏:NGramTokenizer需要一個VCorpus對象,而不是Corpus物體。

library(tm) 
library(RWeka) 

# some dummy text 
text <- c("Lorem ipsum dolor sit amet, consetetur sadipscing elitr", 
      "sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat", 
      "sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum", 
      "Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet") 

# create a VCorpus 
corpus <- VCorpus(VectorSource(text)) 


biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2)) 


biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer)) 

print(biTDM$dimnames$Terms) 

[1] "accusam et"   "aliquyam erat"   "amet consetetur"  "at vero"    "clita kasd"   "consetetur sadipscing" "diam nonumy"   "diam voluptua"   "dolor sit"    "dolore magna"   
[11] "dolores et"   "duo dolores"   "ea rebum"    "eirmod tempor"   "eos et"    "est lorem"    "et accusam"   "et dolore"    "et ea"     "et justo"    
[21] "gubergren no"   "invidunt ut"   "ipsum dolor"   "justo duo"    "kasd gubergren"  "labore et"    "lorem ipsum"   "magna aliquyam"  "no sea"    "nonumy eirmod"   
[31] "sadipscing elitr"  "sanctus est"   "sea takimata"   "sed diam"    "sit amet"    "stet clita"   "takimata sanctus"  "tempor invidunt"  "ut labore"    "vero eos"    
[41] "voluptua at"