在R中保留標點符號文檔術語表

我試圖在R中使用DocumentTermMatrix，使用參數control = list()將術語限制爲預定義的基於文本的表情符號列表（：D，:)，:(，等等。）。但是，dtm不會吸取某些表情符號（如":D"或":)"），但其他一些工作正常（":))"）。我的代碼：在R中保留標點符號文檔術語表

text = c(":D", ":))") 
corpus <- Corpus(VectorSource(text) 
corpus = tm_map(corpus, PlainTextDocument) 
dtm = DocumentTermMatrix(corpus, list(dictionary = c(":D" , ":))"))) 
emojidf <- as.data.frame(as.matrix(dtm)) 

    :D :)) 
1 0 0 
2 0 1

爲了解決這個問題，我可以用content_transformer和gsub改變問題的表情符號，來話。但是，我想知道如何DocumentTermMatrix甚至Corpus作爲單詞處理標點符號。

來源

2017-04-21 huydinh282

有兩個問題（請參閱?DocumentTermMatrix和?termFreq）：默認情況下，wordLengths過濾器要求最小字長度爲3個字符。並且默認tolower將:D轉換爲:d。因此請嘗試：

library(tm) 
text <- c(":D", ":))") 
corpus <- Corpus(VectorSource(text)) 
dtm <- DocumentTermMatrix(
    corpus, 
    control = list(
    dictionary = c(":D" , ":))"), 
    wordLengths=c(-Inf,Inf), 
    tolower=FALSE 
) 
) 
as.matrix(dtm) 
#  Terms 
# Docs :)) :D 
# 1 0 1 
# 2 1 0

來源

2017-04-21 09:49:51 lukeA

感謝您對'tolower'的默認設置！我計算出了3個字母的下界，但並不認爲tolower會嵌入在dtm中，因爲我通常在tm_map之前使用dtm – huydinh282

在R中保留標點符號文檔術語表

回答

相關問題