2017-05-08 42 views
0

我在R中是全新的,所以這個問題看起來很明顯。但是,我沒有管理,也沒有找到解決方案計算「行」標記中的字

如何計算我的令牌內的單詞數量,而它們是行(實際上是評論)? 因此,有與產品ID(ASIN)連接評論(reviewText)數據集

amazonr_tidy_sent = amazonr_tidy_sent%>%unnest_tokens(word, reviewText, token = "lines") amazonr_tidy_sent = amazonr_tidy_sent %>% anti_join(stop_words)%>%ungroup()

我試着按以下方式

wordcounts <- amazonr_tidy_sent %>% group_by(word, asin)%>% summarize(word = n())

做,但它不是適當。我認爲,這是沒有辦法來算,因爲行令牌不能被「隔離」

非常感謝

+0

是什麼'amazonr_tidy_sent'樣子? – Gopala

+0

兩列:「asin」(例如,B000M341QE,B000J3OTO6等)和「單詞」。 「單詞」列是使用'unnest_tokens'標記爲行的評論 –

+0

您能發佈'dput(head(amazonr_tidy_sent,10))'嗎? – Gopala

回答

0

您可以多次使用unnest_tokens()多,如果它是適合您的分析。

首先,您可以使用unnest_tokens()來獲取所需的行。請注意,我正在添加一列以跟蹤每行的ID;你可以任意調用,但重要的是有一個列會記錄你所在的行。現在

library(tidytext) 
library(dplyr) 
library(janeaustenr) 


d <- data_frame(txt = prideprejudice) 

d_lines <- d %>% 
    unnest_tokens(line, txt, token = "lines") %>% 
    mutate(id = row_number()) 

d_lines 

#> # A tibble: 10,721 × 2 
#>                  line 
#>                  <chr> 
#> 1              pride and prejudice 
#> 2               by jane austen 
#> 3                chapter 1 
#> 4 it is a truth universally acknowledged, that a single man in possession 
#> 5       of a good fortune, must be in want of a wife. 
#> 6 however little known the feelings or views of such a man may be on his 
#> 7 first entering a neighbourhood, this truth is so well fixed in the minds 
#> 8 of the surrounding families, that he is considered the rightful property 
#> 9         of some one or other of their daughters. 
#> 10 "my dear mr. bennet," said his lady to him one day, "have you heard that 
#> # ... with 10,711 more rows, and 1 more variables: id <int> 

可以使用unnest_tokens()再次,但這次words,這樣你會得到一排的每個字。請注意,您仍然知道每個單詞來自哪一行。

d_words <- d_lines %>% 
    unnest_tokens(word, line, token = "words") 

d_words 
#> # A tibble: 122,204 × 2 
#>  id  word 
#> <int>  <chr> 
#> 1  1  pride 
#> 2  1  and 
#> 3  1 prejudice 
#> 4  2  by 
#> 5  2  jane 
#> 6  2 austen 
#> 7  3 chapter 
#> 8  3   1 
#> 9  4  it 
#> 10  4  is 
#> # ... with 122,194 more rows 

現在你可以做任何你想要的,例如,也許你想知道的每一行有多少話了計數?

d_words %>% 
    count(id) 

#> # A tibble: 10,715 × 2 
#>  id  n 
#> <int> <int> 
#> 1  1  3 
#> 2  2  3 
#> 3  3  2 
#> 4  4 12 
#> 5  5 11 
#> 6  6 15 
#> 7  7 13 
#> 8  8 11 
#> 9  9  8 
#> 10 10 15 
#> # ... with 10,705 more rows 
0

通過分割使用str_split每一行中,我們可以指望每行字數的數量。

一些示例數據(包含換行符和停用詞):

library(dplyr) 
library(tidytext) 
d = data_frame(reviewText = c('1 2 3 4 5 able', '1 2\n3 4 5\n6\n7\n8\n9 10 above', '1!2', '1', 
          '!', '', '\n', '1', 'able able', 'above above', 'able', 'above'), 
      asin = rep(letters, each = 2, length.out = length(reviewText))) 

計數的單詞數:

by_line %>% 
    group_by(asin) %>% 
    summarize(word = sum(sapply(strsplit(word, '\\s'), length))) 

    asin word 
    <chr> <int> 
1  a 17 
2  b  2 
3  c  1 
4  d  1 
5  e  4 

注意:您原來的代碼最停用詞不會因爲你被刪除按行分割數據。只有完全是單個停用詞的行纔會被刪除。

要排除從單詞計數使用該禁用詞:

by_line %>% 
    group_by(asin) %>% 
    summarize(word = word %>% strsplit('\\s') %>% 
        lapply(setdiff, y = stop_words$word) %>% sapply(length) %>% sum) 

    asin word 
    <chr> <int> 
1  a 15 
2  b  2 
3  c  1 
4  d  1 
5  e  0 
6  f  0