2016-01-23 119 views
2

這是一個問題,我昨天問了遵循: Partial string match two columns R雙正則表達式匹配列[R

提供給這個答案是偉大的;然而,我發現許多物種並沒有被直接提及,也就是說烏龜從來沒有被直接描述在數據產品中,但是「異國情調」是可以接受的匹配。

dats<-data.frame(ID=c(1:4),species=c("dog","cat","rabbit","tortoise"), 
      species.descriptor=c("all animal dog","all animal cat","rabbit exotic","tortoise exotic"), 
      product=c(1,2,3,4),product.authorise=c("all animal dog cat rabbit","cat horse pig", 
      "dog cat","exotic")) 
dats 
    ID species species.descriptor product   product.authorise 
    1  dog  all animal dog  1 all animal dog cat rabbit 
    2  cat  all animal cat  2    cat horse pig 
    3 rabbit  rabbit exotic  3     dog cat 
    4 tortoise tortoise exotic  4     exotic 

我想出了那個作品基礎上結合$ species.descriptor和$ product.authorise在一起,然後指定行作爲「TRUE」如果一個特定的REG EXP出現在兩個或更多次的解決方案像這樣的字段:

library(stringr) 
dats$bound<-paste(dats$product.authorise, dats$species.descriptor) 

species_descriptor<-c("all animal","dog","cat","rabbit","exotic","horse","pig","tortoise") 
species_descriptor<-setNames(nm=species_descriptor) 
result<-ifelse(sapply(species_descriptor, str_count, string=dats$bound)>=2,"TRUE","FALSE") 
result<-as.data.frame(result) 

result$AuthorisedCount<-apply(result[,1:ncol(result)],MARGIN=1,function(x){sum(x=="TRUE",na.rm=T)}) 
result$SpeciesAuthorised<-ifelse(result$AuthorisedCount>=1,"TRUE","FALSE") 

dats<-cbind(dats, result$SpeciesAuthorised) 
names(dats)[7]<-"SpeciesAuthorised" 
dats$bound<-NULL 

dats 
    ID species species.descriptor product   product.authorise SpeciesAuthorised 
    1  dog  all animal dog  1 all animal dog cat rabbit    TRUE 
    2  cat  all animal cat  2    cat horse pig    TRUE 
    3 rabbit  rabbit exotic  3     dog cat    FALSE 
    4 tortoise tortoise exotic  4     exotic    TRUE 

這很好,在大得多的數據集工作很快;但是,我意識到可能有更優雅的做事方式。我想知道有沒有人有任何建議?

回答

2

使用sapply函數調用和bound變量產生相同的結果:

bound<-paste(dats$product.authorise, dats$species.descriptor) 
dats$SpeciesAuthorised <- as.logical(rowSums(sapply(species_descriptor, str_count, string=bound)>=2)) 
# ID species species.descriptor product   product.authorise SpeciesAuthorised 
# 1 1  dog  all animal dog  1 all animal dog cat rabbit    TRUE 
# 2 2  cat  all animal cat  2    cat horse pig    TRUE 
# 3 3 rabbit  rabbit exotic  3     dog cat    FALSE 
# 4 4 tortoise tortoise exotic  4     exotic    TRUE 
1

擴展你提到的將這項工作職位?

dats$SpeciesAuthorised <- with(dats, 
           str_detect(species.descriptor, species) & 
            (str_detect(product.authorise, species) | str_detect(species.descriptor,product.authorise)) 
) 

我只是在函數中添加了一個OR運算符來檢測species.descriptor中的product.authorise中的模式。

dats 
    ID species species.descriptor product   product.authorise SpeciesAuthorised 
1 1  dog  all animal dog  1 all animal dog cat rabbit    TRUE 
2 2  cat  all animal cat  2    cat horse pig    TRUE 
3 3 rabbit  rabbit exotic  3     dog cat    FALSE 
4 4 tortoise tortoise exotic  4     exotic    TRUE 
1

您可以使用功能any減少代碼:

bound <- paste(dats$product.authorise, dats$species.descriptor) 
result <- ifelse(sapply(species_descriptor, str_count, string=bound)>=2, TRUE, FALSE) 

dats$SpeciesAuthorised <- apply(result, 1, any) 

而無需設置的結果,"TRUE""FALSE"字符,使用邏輯值。

另外,如果你想使代碼更乾淨和可讀性,你可以定義自己的職能:

isSpeciesAuthorised = function(data, species_descriptor) { 
    bound <- paste(data$product.authorise, data$species.descriptor) 
    result <- ifelse(sapply(species_descriptor, str_count, string=bound)>=2, TRUE, FALSE) 

    return(apply(result, 1, any)) 
} 

,然後用它們:

dats$SpeciesAuthorised <- isSpeciesAuthorised(data=dats, species_descriptor)