2017-11-10 225 views
3

我有一長串字符串,它們共享子字符串。該列表來自事件流數據,因此有成千上萬行,但我會簡化這個例子;寵物:使用子字符串R查找字符串R

+--------------------------------+ 
|    Pets    | 
+--------------------------------+ 
| "one calico cat that's smart" | 
| "German Shepard dog"   | 
| "A Chameleon that is a Lizard" | 
| "a cute tabby cat"    | 
| "the fish guppy"    | 
| "Lizard Gecko"     | 
| "German Shepard dog"   | 
| "Budgie Bird"     | 
| "Canary Bird in a coal mine" | 
| "a chihuahua dog"    | 
+--------------------------------+ 
dput output: structure(list(Pets = structure(c(8L, 6L, 1L, 3L, 9L, 7L, 6L, 4L, 5L, 2L),.Label = c("A Chameleon that is a Lizard", "a chihuahua dog", "a cute tabby cat", "Budgie Bird", "Canary Bird in a coal mine", "German Shepard dog", "Lizard Gecko", "one calico cat that's smart", "the fish guppy"), class = "factor")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame") 

我想基礎上,通用型寵物(狗,貓等)添加信息,我有保留此信息一鍵表:

+----------+----------------+ 
| key | classification | 
+----------+----------------+ 
| "dog" | "canine"  | 
| "cat" | "feline"  | 
| "lizard" | "reptile"  | 
| "bird" | "avian"  | 
| "fish" | "fish"   | 
+----------+----------------+ 
dput output: structure(list(key = structure(c(3L, 2L, 5L, 1L, 4L), .Label = c("bird", "cat", "dog", "fish", "lizard"), class = "factor"), classification = structure(c(2L, 3L, 5L, 1L, 4L), .Label = c("avian", "canine", "feline", "fish", "reptile"), class = "factor")), .Names = c("key", "classification"), row.names = c(NA, -5L), class = "data.frame") 

怎麼辦我使用Pets表中的「長字符串」在密鑰表中查找相關的classification?問題是,我的查找字符串包含在密鑰表中找到的子字符串。

我用grepl這樣開始:

key[grepl(pets[1,1], key[ , 2]), ] 

但是,這是行不通的,因爲「三色貓」是不是在鍵列表,雖然「貓」是。我正在尋找的結果將是「feline」。 (注意:我不能簡單地切換事物,因爲在我自己的代碼中,它位於一個apply函數中,並且循環遍歷數據中的每一行。所以,而不是pets[1,1]它是pets[n,1]最後我打算cbind對事件流數據的結果做進一步分析。)

我在繞包裝如何做到這一點時遇到了麻煩。有什麼建議?

+0

看來,鍵總是每個「長字符串」的第二個字。這是一個合理的假設嗎? – useR

+0

不幸的是,沒有。字符串有幾個到幾個不同的單詞。我只知道「關鍵」字在那裏。 – JoeM05

+1

然後你應該提供一個不符合這個假設的長字符串。此外,請提供您的數據集,並將'dput(my_data)'的輸出複製並粘貼到您的問題中,而不是您目前如何格式化它的數據集 – useR

回答

2

你可以使用包fuzzyjoin很容易做這些事情。

在這裏,您可以使用regex_left_join,它的工作原理就像一個正常的左連接(如dplyr::left_join),不同之處在於,對於rwos是匹配的標準是由正則表達式來確定匹配似stringr::str_detect

library(tibble) 
library(fuzzyjoin) 

pets <- tribble(
          ~pets, 
    "one calico cat that\'s smart", 
      "German Shepard dog", 
    "A Chameleon that is a Lizard", 
       "a cute tabby cat", 
       "the fish guppy", 
        "Lizard Gecko", 
      "German Shepard dog", 
        "Budgie Bird", 
    "Canary Bird in a coal mine", 
       "a chihuahua dog" 
) 

key <- tribble(
     ~key, ~classification, 
     "dog",  "canine", 
     "cat",  "feline", 
    "lizard",  "reptile", 
    "bird",   "avian", 
    "fish",   "fish" 
) 

regex_left_join(pets, key, by = c("pets" = "key"), ignore_case = TRUE) 

#> # A tibble: 10 x 3 
#>       pets key classification 
#>       <chr> <chr>   <chr> 
#> 1 one calico cat that's smart cat   feline 
#> 2   German Shepard dog dog   canine 
#> 3 A Chameleon that is a Lizard lizard  reptile 
#> 4    a cute tabby cat cat   feline 
#> 5    the fish guppy fish   fish 
#> 6     Lizard Gecko lizard  reptile 
#> 7   German Shepard dog dog   canine 
#> 8     Budgie Bird bird   avian 
#> 9 Canary Bird in a coal mine bird   avian 
#> 10    a chihuahua dog dog   canine 
+0

這工作。方便的圖書館,謝謝奧地利人 – JoeM05

1

您可以構建每個寵物密鑰列表,然後看看他們在表

Pattern = paste(KeyTable$key, collapse="|") 
Pattern = paste0(".*(", Pattern, ").*") 
Type = tolower(sub(Pattern, "\\1", ignore.case=TRUE, Pets)) 
KeyTable$classification[match(Type, KeyTable$key)] 
[1] "feline" "canine" "reptile" "feline" "feline" "canine" "fish" 
[8] "reptile" "canine" "avian" "avian" "canine" 

數據

KeyTable = read.table(text="key classification 
dog canine 
cat feline 
lizard reptile 
bird avian  
fish fish", 
header=TRUE, stringsAsFactors=FALSE) 

Pets = c("calico cat", 
"Shepard dog" , 
"Chameleon Lizard", 
"calico cat", 
"tabby cat", 
"chihuahua dog", 
"guppy fish", 
"Gecko Lizard", 
"Shepard dog", 
"Budgie Bird", 
"Canary Bird" , 
"chihuahua dog") 
1

下面是使用另一種方法hashmap

library(hashmap) 

hash_table = hashmap(Lookup$key, Lookup$classification) 

Pets %>% 
    separate_rows(Pets, sep = " ") %>% 
    mutate(class = hash_table[[tolower(Pets)]]) %>% 
    na.omit() %>% 
    select(Key = Pets, class) %>% 
    bind_cols(Pets, .) 

結果:

> hash_table 
## (character) => (character) 
##  [fish] => [fish]  
##  [bird] => [avian]  
## [lizard] => [reptile] 
##  [cat] => [feline] 
##  [dog] => [canine] 

          Pets Key class 
1 one calico cat that's smart cat feline 
2   German Shepard dog dog canine 
3 A Chameleon that is a Lizard Lizard reptile 
4    a cute tabby cat cat feline 
5    the fish guppy fish fish 
6     Lizard Gecko Lizard reptile 
7   German Shepard dog dog canine 
8     Budgie Bird Bird avian 
9 Canary Bird in a coal mine Bird avian 
10    a chihuahua dog dog canine 

數據:

Pets = structure(list(Pets = c("one calico cat that's smart", "German Shepard dog", 
           "A Chameleon that is a Lizard", "a cute tabby cat", "the fish guppy", 
           "Lizard Gecko", "German Shepard dog", "Budgie Bird", "Canary Bird in a coal mine", 
           "a chihuahua dog")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame") 


Lookup = structure(list(key = c("dog", "cat", "lizard", "bird", "fish"), 
         classification = c("canine", "feline", "reptile", "avian", 
         "fish")), class = "data.frame", .Names = c("key", "classification" 
        ), row.names = c(NA, -5L))