tm包中的DocumentTermMatrix不會返回所有單詞

我正在用R中的tm-package創建文檔項矩陣，但是我的語料庫中的某些單詞在某個過程中會丟失。tm包中的DocumentTermMatrix不會返回所有單詞

我會用一個例子來解釋。然而

dm <- DocumentTermMatrix(crps) 
dm_matrix <- as.matrix(dm) 
dm_matrix 
# Terms 
# Docs and bout class home hours more next night 
# 1 1 1  1 1  1 1 1  2

，我想（和預期）是：比方說，我有這個小陰莖

library(tm) 
crps <- " more hours to my next class bout to go home and go night night" 
crps <- VCorpus(VectorSource(crps))

當我使用DocumentTermMatrix()從TM-包，它會返回這些結果

# Docs and bout class home hours more next night my go to 
# 1 1 1  1 1  1 1 1  2 1 2 1

爲什麼DocumentTermMatrix()跳過「my」，「go」和「to」兩個字？有沒有辦法控制和修復這個功能？

來源

2017-10-09 Fouad Selmane

我假設你使用'tm'包？什麼樣的對象是'crps'？你是怎麼得到'crps'的？你用'crps < - 語料庫（VectorSource（some_text_string））'這樣的東西嗎？ –

是的，我用'crps <-VCorpus（VectorSource（My_text））' –

DocumentTermMatrix()自動丟棄小於三個字符的單詞。因此，構建文檔術語矩陣時不考慮文字to,my和go。

在幫助頁面?DocumentTermMatrix中，您可以看到有一個可選參數control。這個可選的參數有很多默認值（參見幫助頁面?termFreq瞭解更多細節）。其中一個缺省值是至少三個字符的字長，即wordLengths = c(3, Inf)。你可以改變這個以容納所有字，而不管字的長度如何：

dm <- DocumentTermMatrix(my_corpus, control = list(wordLengths=c(1, Inf)) 

inspect(dm) 
# <<DocumentTermMatrix (documents: 1, terms: 11)>> 
# Non-/sparse entries: 11/0 
# Sparsity   : 0% 
# Maximal term length: 5 
# Weighting   : term frequency (tf) 
# 
# Terms 
# Docs and bout class go home hours more my next night to 
# 1 1 1  1 2 1  1 1 1 1  2 2

來源

2017-10-09 11:40:06

tm包中的DocumentTermMatrix不會返回所有單詞

回答

相關問題