2017-02-14 54 views
0

所以我正在研究一種方法,可以獲取由字符分隔的事件序列中出現的配對詞的頻率。例如:如何使用R來統計保持文本中的順序序列的雙字詞的頻率?

Input: 
"Start>Press1>Press2>PressQR>Exit" 
"Start>PressA>Press2>PressQR>QuitL>Exit" 
"Start>Press1>Press2>Press3>Exit"` 

Output: 
Start>Press1 2 
Press1>Press2 2 
Press2>PressQR 2 
PressQR>Exit 1 
Start>PressA 1 
PressA>Press2 2 
Press2>PressQR 1 
PressQR>QuitL 1 
QuitL>Exit  1 
Press2>Press3 1 
Press3>Exit  1 

謝謝。

回答

3

做一個定向EdgeList都然後彙總:

edgelist <- do.call(rbind, lapply(strsplit(x,">"), function(x) cbind(head(x,-1), x[-1]))) 
aggregate(count ~ ., data.frame(edgelist,count=1), FUN=sum) 

#  X1  X2 count 
#1 Press3 Exit  1 
#2 PressQR Exit  1 
#3 QuitL Exit  1 
#4 Start Press1  2 
#5 Press1 Press2  2 
#6 PressA Press2  1 
#7 Press2 Press3  1 
#8 Start PressA  1 
#9 Press2 PressQR  2 
#10 PressQR QuitL  1 
1
input <- c("Start>Press1>Press2>PressQR>Exit","Start>PressA>Press2>PressQR>QuitL>Exit","Start>Press1>Press2>Press3>Exit") 

gen_pairs <- function(x) 
{ 
    x_split <- unlist(strsplit(x,">")) 
    paste(x_split[-length(x_split)],x_split[-1],sep=">") 
} 
all_pairs <- unlist(lapply(input,gen_pairs)) 
all_pairs_ctab <- table(all_pairs) 
as.data.frame(all_pairs_ctab[match(unique(all_pairs),names(all_pairs_ctab))]) 
1

你可以通過它unnest_tokens功能使用tidytext包,它支持NGRAM分詞:

library(dplyr) 
library(tidytext) 

data.frame(text = c("Start>Press1>Press2>PressQR>Exit", "Start>PressA>Press2>PressQR>QuitL>Exit", "Start>Press1>Press2>Press3>Exit")) %>%  
    unnest_tokens(bigram, text, 'ngrams', n = 2, to_lower = FALSE) %>% 
    count(bigram) 

#> # A tibble: 11 × 2 
#>   bigram  n 
#>    <chr> <int> 
#> 1  Exit Start  2 
#> 2 Press1 Press2  2 
#> 3 Press2 Press3  1 
#> 4 Press2 PressQR  2 
#> 5  Press3 Exit  1 
#> 6 PressA Press2  1 
#> 7 PressQR Exit  1 
#> 8 PressQR QuitL  1 
#> 9  QuitL Exit  1 
#> 10 Start Press1  2 
#> 11 Start PressA  1 

或者,如果你願意,你可以做同樣的事情與底層tokenizers::tokenize_ngrams功能和table

相關問題