2017-02-18 95 views
1

我想執行以下計算:NGRAM在R:計算單詞頻率和值的總和

輸入:

Column_A     Column_B 
Word_A      10 
Word_A Word_B    20 
Word_B Word_A    30 
Word_A Word_B Word_C  40 

輸出:

Column_A1     Column_B1 
Word_A      100 = 10+20+30+40 
Word_B      90 = 20+30+40 
Word_C      40 = 40 
Word_A Word_B    90 = 20+30+40 
Word_A Word_C    40 = 40 
Word_B Word_C    40 = 40 
Word_A Word_B Word_C  40 = 40 

的輸出中單詞的順序無關緊要,所以Word_A Word_B = 90 = Word_B Word_A。使用RWeka和TM庫,我能提取unigram進行(只有一個字),位我需要有n元,其中n = 1,2,3和計算column_B1

回答

1

一個tidyverse方法:

library(tidyverse) 
library(tokenizers) 

df %>% 
    rowwise() %>% 
    mutate(ngram = list(c(tokenize_ngrams(Column_A, lowercase = FALSE, n = 3, n_min = 1), 
           tokenize_skip_ngrams(Column_A, lowercase = FALSE, n = 2), 
          recursive = TRUE)), 
      ngram = list(unique(map_chr(strsplit(ngram, ' '), 
             ~paste(sort(.x), collapse = ' '))))) %>% 
    unnest() %>% 
    count(ngram, wt = Column_B) 

## # A tibble: 7 × 2 
##     ngram  n 
##     <chr> <int> 
## 1    Word_A 100 
## 2  Word_A Word_B 90 
## 3 Word_A Word_B Word_C 40 
## 4  Word_A Word_C 40 
## 5    Word_B 90 
## 6  Word_B Word_C 40 
## 7    Word_C 40 

請注意,目前只有通過三個字的字符串才能生效。對於更長的字符串,你必須弄清楚你想要跳過多少ngrams,或者採取不同的方法。