編寫一個函數，使用R

-1

找到文本字符串中最常用的單詞，我需要編寫一個函數來查找文本字符串中最常用的單詞，以便如果我將任何單詞序列定義爲「單詞」。編寫一個函數，使用R

它可以返回最常用的單詞。

2017-10-12 Shivam

這是我設計的一個功能。請注意，我基於空格分割字符串，刪除了任何空白或空白，我也刪除了「。」，並將所有大寫字母轉換爲小寫字母。最後，如果有平局，我總是會報第一個字。這些假設你應該考慮爲你自己的分析。

# Create example string 
string <- "This is a very short sentence. It has only a few words." 

library(stringr) 

most_common_word <- function(string){ 
    string1 <- str_split(string, pattern = " ")[[1]] # Split the string 
    string2 <- str_trim(string1) # Remove white space 
    string3 <- str_replace_all(string2, fixed("."), "") # Remove dot 
    string4 <- tolower(string3) # Convert to lower case 
    word_count <- table(string4) # Count the word number 
    return(names(word_count[which.max(word_count)][1])) # Report the most common word 
} 

most_common_word(string) 
[1] "a"

來源

2017-10-12 14:13:39 www

希望這有助於：

most_common_word=function(x){ 

     #Split every word into single words for counting 
     splitTest=strsplit(x," ") 

     #Counting words 
     count=table(splitTest) 

     #Sorting to select only the highest value, which is the first one 
     count=count[order(count, decreasing=TRUE)][1] 

     #Return the desired character. 
     #By changing this you can choose whether it show the number of times a word repeats 
     return(names(count)) 
     }

您可以使用return(count)顯示字，再加上它的重複的次數。當兩個單詞重複相同的次數時，此功能會出現問題，因此請小心。

order函數獲得最高值（與decreasing=TRUE一起使用時），則它取決於名稱，它們按字母排序。在'a'和'b'這幾個字重複相同的次數的情況下，most_common_word函數只顯示'a'。

來源

2017-10-12 14:04:53 Cris

對於一般目的，是更好地使用boundary("word")在stringr：

library(stringr) 
most_common_word <- function(s){ 
    which.max(table(s %>% str_split(boundary("word")))) 
} 
sentence <- "This is a very short sentence. It has only a few words: a, a. a" 
most_common_word(sentence)

來源

2017-10-12 14:13:35

極好的使用'邊界'修飾符。 – www

感謝您的出色答案，它的工作。現在，只是一個問題，如果我有一個文本文件，我讀它作爲desc。 words < - rep（''，length（desc））; system.time（爲（I在1：長度（遞減））{ 字[I] < - most_common_word（降序[I]） } ）然後，在計算，我對着錯誤。同樣保存這個代碼的問題。 – Shivam

@S。奧利維亞，請參閱有關編輯2？任何建議？ – Shivam

使用tidytext包，以建立解析功能優勢：

library(tidytext) 
library(dplyr) 
word_count <- function(test_sentence) { 
unnest_tokens(data.frame(sentence = test_sentence, 
    stringsAsFactors = FALSE), word, sentence) %>% 
count(word, sort = TRUE) 
} 

word_count("This is a very short sentence. It has only a few words.")

這給你所有的字計數的表。您可以調整函數以獲得最佳值，但要注意有時候會有首先關係，所以它應該足夠靈活以提取多個獲勝者。

來源

2017-10-12 14:49:19

編寫一個函數，使用R

回答

相關問題