用純文本輸入的純文本輸入的簡單部分標記

我正在使用tidytext（和tidyverse）分析一些文本數據（如Tidy Text Mining with R）。用純文本輸入的純文本輸入的簡單部分標記

我輸入的文本文件，myfile.txt，看起來是這樣的：

# Section 1 Name 
Lorem ipsum dolor 
sit amet ... (et cetera) 
# Section 2 Name 
<multiple lines here again>

與60層左右的部分。

我想生成一個列section_name與字符串"Category 1 Name"或"Category 2 Name"作爲相應的行的值。例如，我有

library(tidyverse) 
library(tidytext) 
library(stringr) 

fname <- "myfile.txt" 
all_text <- readLines(fname) 
all_lines <- tibble(text = all_text) 
tidiedtext <- all_lines %>% 
    mutate(linenumber = row_number(), 
     section_id = cumsum(str_detect(text, regex("^#", ignore_case = TRUE)))) %>% 
    filter(!str_detect(text, regex("^#"))) %>% 
    ungroup()

這增加了一列中tidiedtext對於每行相應的節號。

是否可以添加一行到調用mutate()添加這樣的列？還是有另一種方法我應該使用？

來源

2017-02-23 weinerjm

下面是使用grepl爲簡單起見，if_else和tidyr::fill的方法，但原始方法沒有任何問題;它與tidytext書中使用的非常相似。另外請注意，添加行號後進行篩選會導致一些不存在的情況。如果重要，請在filter之後添加行號。

library(tidyverse) 

text <- '# Section 1 Name 
Lorem ipsum dolor 
sit amet ... (et cetera) 
# Section 2 Name 
<multiple lines here again>' 

all_lines <- data_frame(text = read_lines(text)) 

tidied <- all_lines %>% 
    mutate(line = row_number(), 
      section = if_else(grepl('^#', text), text, NA_character_)) %>% 
    fill(section) %>% 
    filter(!grepl('^#', text)) 

tidied 
#> # A tibble: 3 × 3 
#>       text line   section 
#>       <chr> <int>   <chr> 
#> 1   Lorem ipsum dolor  2 # Section 1 Name 
#> 2 sit amet ... (et cetera)  3 # Section 1 Name 
#> 3 <multiple lines here again>  5 # Section 2 Name

或者，如果你只是想格式化你已經拿到了號碼，只需添加section_name = paste('Category', section_id, 'Name')到您的電話mutate。

來源

2017-02-23 21:34:41 alistaire

謝謝！這幾乎是我正在尋找的。 – weinerjm

我不希望有你重寫你的整個腳本，但我剛剛發現的問題有趣，想添加一個基礎R暫定：

parse_data <- function(file_name) { 
    all_rows <- readLines(file_name) 
    indices <- which(grepl('#', all_rows)) 
    splitter <- rep(indices, diff(c(indices, length(all_rows)+1))) 
    lst <- split(all_rows, splitter) 
    lst <- lapply(lst, function(x) { 
    data.frame(section=x[1], value=x[-1], stringsAsFactors = F) 
    }) 
    line_nums = seq_along(all_rows)[-indices] 
    df <- do.call(rbind.data.frame, lst) 
    cbind.data.frame(df, linenumber = line_nums) 
}

測試名爲ipsum_data.txt文件：

parse_data('ipsum_data.txt')

產量：

text      section   linenumber 
Lorem ipsum dolor   # Section 1 Name 2   
sit amet ... (et cetera) # Section 1 Name 3   
<multiple lines here again> # Section 2 Name 5

文件ipsum_data.txt包含：

# Section 1 Name 
Lorem ipsum dolor 
sit amet ... (et cetera) 
# Section 2 Name 
<multiple lines here again>

我希望這證明有用。

來源

2017-02-23 22:17:51 Abdou

感謝您的回覆。這非常有幫助。重寫腳本對我來說沒什麼大不了的，但我認爲另一種解決方案更多的是我在簡潔性方面尋找的東西。 – weinerjm

用純文本輸入的純文本輸入的簡單部分標記

回答

相關問題