您可以多次使用unnest_tokens()
多,如果它是適合您的分析。
首先,您可以使用unnest_tokens()
來獲取所需的行。請注意,我正在添加一列以跟蹤每行的ID;你可以任意調用,但重要的是有一個列會記錄你所在的行。現在
library(tidytext)
library(dplyr)
library(janeaustenr)
d <- data_frame(txt = prideprejudice)
d_lines <- d %>%
unnest_tokens(line, txt, token = "lines") %>%
mutate(id = row_number())
d_lines
#> # A tibble: 10,721 × 2
#> line
#> <chr>
#> 1 pride and prejudice
#> 2 by jane austen
#> 3 chapter 1
#> 4 it is a truth universally acknowledged, that a single man in possession
#> 5 of a good fortune, must be in want of a wife.
#> 6 however little known the feelings or views of such a man may be on his
#> 7 first entering a neighbourhood, this truth is so well fixed in the minds
#> 8 of the surrounding families, that he is considered the rightful property
#> 9 of some one or other of their daughters.
#> 10 "my dear mr. bennet," said his lady to him one day, "have you heard that
#> # ... with 10,711 more rows, and 1 more variables: id <int>
可以使用unnest_tokens()
再次,但這次words
,這樣你會得到一排的每個字。請注意,您仍然知道每個單詞來自哪一行。
d_words <- d_lines %>%
unnest_tokens(word, line, token = "words")
d_words
#> # A tibble: 122,204 × 2
#> id word
#> <int> <chr>
#> 1 1 pride
#> 2 1 and
#> 3 1 prejudice
#> 4 2 by
#> 5 2 jane
#> 6 2 austen
#> 7 3 chapter
#> 8 3 1
#> 9 4 it
#> 10 4 is
#> # ... with 122,194 more rows
現在你可以做任何你想要的,例如,也許你想知道的每一行有多少話了計數?
d_words %>%
count(id)
#> # A tibble: 10,715 × 2
#> id n
#> <int> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 2
#> 4 4 12
#> 5 5 11
#> 6 6 15
#> 7 7 13
#> 8 8 11
#> 9 9 8
#> 10 10 15
#> # ... with 10,705 more rows
是什麼'amazonr_tidy_sent'樣子? – Gopala
兩列:「asin」(例如,B000M341QE,B000J3OTO6等)和「單詞」。 「單詞」列是使用'unnest_tokens'標記爲行的評論 –
您能發佈'dput(head(amazonr_tidy_sent,10))'嗎? – Gopala