2016-08-13 70 views
1

我想根據特定條件從字符串(評論)中提取一些數字。我想提取的數字直接以24小時格式記錄日期,並且總是包含小數位,並且小於20(字符串中還有其他數字,但我對這些數字不感興趣)。我已經設法提取了我想用下面的R代碼編號的數字,但沒有辦法將它們與它們來自的ID相關聯。有些身份證號碼有多個興趣點,有些則只有一個。例如,我需要一些方法將下面給出的虛擬數據中的ID號與每個感興趣的數相關聯。正如你所看到的,ID 1包含三個感興趣的結果(4.1,6.9和4.3),而ID 2只有一個感興趣的結果(6.5)。根據特定標準從R中的字符串提取數字

任何幫助將是太棒了!

(An example of the format of comment.txt) 

    ID comments 
    1 abc1200 4.1 abc1100 6.9 etd1130 4.3 69.0 
    2 abc0900 6.5 abcde 15 
    3 3.2 0850 9.5 abc 8.2 0930 12.2 agft 75.0 
    4 ashdfalsk 0950 10.5 dvvxcvszv asdasd assdas d 75.0 


#rm(list=ls(all=TRUE)) 

#import text and pull out a list of all numbers contained withtin the free text 
raw_text <- read.delim("comment.txt") 
numbers_from_text <- gregexpr("[0-9]+.[0-9]", raw_text$comments) 

numbers_list <- unlist(regmatches(raw_text$comments, numbers_from_text)) 
numbers_list <- as.data.frame(numbers_list) 

#pull out those numbers that contain an decimal place and create a running count 
format<-cbind(numbers_list,dem=(grepl("\\.",as.character(numbers_list$numbers_list)))*1,row.number=1:nrow(numbers_list)) 

#if the number does not contain a decimal (a date) then create a new row number which is the addition of the first row 
#else return NA 
test <- cbind(format,new_row = ifelse(format$dem==0, format$row.number+1, "NA")) 

#match the cases where the new_row is equal to the row.number and then output the corresponding numbers_list 
match <-test$numbers_list[match(test$new_row,test$row.number)] 

#get rid of the NA's for where there wasnt a match and values less than 20 to ensure results are correct 
match_NA <- subset(match, match!= "<NA>" & as.numeric(as.character(match))<20) 

match_NA <- as.data.frame(match_NA) 

回答

0

像這樣似乎工作,匹配開頭的空白,其包含一個週期,然後轉換爲數字和提取哪些是小於20。

library(stringr) 
temp <- apply(comments, 1, function(x) { 
    str_extract_all(x,"[[:blank:]][0-9]+[.][0-9]") 
}) 

library(purrr) 
temp <- lapply(flatten(temp), function(x) as.numeric(str_trim(x))) 
lapply(temp, function(x) x[x <20]) 

[[1]] 
[1] 4.1 6.9 4.3 

[[2]] 
[1] 6.5 

[[3]] 
[1] 3.2 9.5 8.2 12.2 

[[4]] 
[1] 10.5 
NUMERICS