計算「行」標記中的字

我在R中是全新的，所以這個問題看起來很明顯。但是，我沒有管理，也沒有找到解決方案計算「行」標記中的字

如何計算我的令牌內的單詞數量，而它們是行（實際上是評論）？因此，有與產品ID（ASIN）連接評論（reviewText）數據集

amazonr_tidy_sent = amazonr_tidy_sent%>%unnest_tokens(word, reviewText, token = "lines") amazonr_tidy_sent = amazonr_tidy_sent %>% anti_join(stop_words)%>%ungroup()

我試着按以下方式

wordcounts <- amazonr_tidy_sent %>% group_by(word, asin)%>% summarize(word = n())

做，但它不是適當。我認爲，這是沒有辦法來算，因爲行令牌不能被「隔離」

非常感謝

來源

2017-05-08 Роман Бронников

是什麼'amazonr_tidy_sent'樣子？ – Gopala

兩列：「asin」（例如，B000M341QE，B000J3OTO6等）和「單詞」。「單詞」列是使用'unnest_tokens'標記爲行的評論 –

您能發佈'dput（head（amazonr_tidy_sent，10））'嗎？ – Gopala

您可以多次使用unnest_tokens()多，如果它是適合您的分析。

首先，您可以使用unnest_tokens()來獲取所需的行。請注意，我正在添加一列以跟蹤每行的ID;你可以任意調用，但重要的是有一個列會記錄你所在的行。現在

library(tidytext) 
library(dplyr) 
library(janeaustenr) 


d <- data_frame(txt = prideprejudice) 

d_lines <- d %>% 
    unnest_tokens(line, txt, token = "lines") %>% 
    mutate(id = row_number()) 

d_lines 

#> # A tibble: 10,721 × 2 
#>                  line 
#>                  <chr> 
#> 1              pride and prejudice 
#> 2               by jane austen 
#> 3                chapter 1 
#> 4 it is a truth universally acknowledged, that a single man in possession 
#> 5       of a good fortune, must be in want of a wife. 
#> 6 however little known the feelings or views of such a man may be on his 
#> 7 first entering a neighbourhood, this truth is so well fixed in the minds 
#> 8 of the surrounding families, that he is considered the rightful property 
#> 9         of some one or other of their daughters. 
#> 10 "my dear mr. bennet," said his lady to him one day, "have you heard that 
#> # ... with 10,711 more rows, and 1 more variables: id <int>

可以使用unnest_tokens()再次，但這次words，這樣你會得到一排的每個字。請注意，您仍然知道每個單詞來自哪一行。

d_words <- d_lines %>% 
    unnest_tokens(word, line, token = "words") 

d_words 
#> # A tibble: 122,204 × 2 
#>  id  word 
#> <int>  <chr> 
#> 1  1  pride 
#> 2  1  and 
#> 3  1 prejudice 
#> 4  2  by 
#> 5  2  jane 
#> 6  2 austen 
#> 7  3 chapter 
#> 8  3   1 
#> 9  4  it 
#> 10  4  is 
#> # ... with 122,194 more rows

現在你可以做任何你想要的，例如，也許你想知道的每一行有多少話了計數？

d_words %>% 
    count(id) 

#> # A tibble: 10,715 × 2 
#>  id  n 
#> <int> <int> 
#> 1  1  3 
#> 2  2  3 
#> 3  3  2 
#> 4  4 12 
#> 5  5 11 
#> 6  6 15 
#> 7  7 13 
#> 8  8 11 
#> 9  9  8 
#> 10 10 15 
#> # ... with 10,705 more rows

來源

2017-05-09 19:52:42

通過分割使用str_split每一行中，我們可以指望每行字數的數量。

一些示例數據（包含換行符和停用詞）：

library(dplyr) 
library(tidytext) 
d = data_frame(reviewText = c('1 2 3 4 5 able', '1 2\n3 4 5\n6\n7\n8\n9 10 above', '1!2', '1', 
          '!', '', '\n', '1', 'able able', 'above above', 'able', 'above'), 
      asin = rep(letters, each = 2, length.out = length(reviewText)))

計數的單詞數：

by_line %>% 
    group_by(asin) %>% 
    summarize(word = sum(sapply(strsplit(word, '\\s'), length))) 

    asin word 
    <chr> <int> 
1  a 17 
2  b  2 
3  c  1 
4  d  1 
5  e  4

注意：您原來的代碼最停用詞不會因爲你被刪除按行分割數據。只有完全是單個停用詞的行纔會被刪除。

要排除從單詞計數使用該禁用詞：

by_line %>% 
    group_by(asin) %>% 
    summarize(word = word %>% strsplit('\\s') %>% 
        lapply(setdiff, y = stop_words$word) %>% sapply(length) %>% sum) 

    asin word 
    <chr> <int> 
1  a 15 
2  b  2 
3  c  1 
4  d  1 
5  e  0 
6  f  0

來源

2017-05-09 21:24:51 Johan

計算「行」標記中的字

回答

相關問題