使用子字符串R查找字符串R

我有一長串字符串，它們共享子字符串。該列表來自事件流數據，因此有成千上萬行，但我會簡化這個例子;寵物：使用子字符串R查找字符串R

+--------------------------------+ 
|    Pets    | 
+--------------------------------+ 
| "one calico cat that's smart" | 
| "German Shepard dog"   | 
| "A Chameleon that is a Lizard" | 
| "a cute tabby cat"    | 
| "the fish guppy"    | 
| "Lizard Gecko"     | 
| "German Shepard dog"   | 
| "Budgie Bird"     | 
| "Canary Bird in a coal mine" | 
| "a chihuahua dog"    | 
+--------------------------------+ 
dput output: structure(list(Pets = structure(c(8L, 6L, 1L, 3L, 9L, 7L, 6L, 4L, 5L, 2L),.Label = c("A Chameleon that is a Lizard", "a chihuahua dog", "a cute tabby cat", "Budgie Bird", "Canary Bird in a coal mine", "German Shepard dog", "Lizard Gecko", "one calico cat that's smart", "the fish guppy"), class = "factor")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame")

我想基礎上，通用型寵物（狗，貓等）添加信息，我有保留此信息一鍵表：

+----------+----------------+ 
| key | classification | 
+----------+----------------+ 
| "dog" | "canine"  | 
| "cat" | "feline"  | 
| "lizard" | "reptile"  | 
| "bird" | "avian"  | 
| "fish" | "fish"   | 
+----------+----------------+ 
dput output: structure(list(key = structure(c(3L, 2L, 5L, 1L, 4L), .Label = c("bird", "cat", "dog", "fish", "lizard"), class = "factor"), classification = structure(c(2L, 3L, 5L, 1L, 4L), .Label = c("avian", "canine", "feline", "fish", "reptile"), class = "factor")), .Names = c("key", "classification"), row.names = c(NA, -5L), class = "data.frame")

怎麼辦我使用Pets表中的「長字符串」在密鑰表中查找相關的classification？問題是，我的查找字符串包含在密鑰表中找到的子字符串。

我用grepl這樣開始：

key[grepl(pets[1,1], key[ , 2]), ]

但是，這是行不通的，因爲「三色貓」是不是在鍵列表，雖然「貓」是。我正在尋找的結果將是「feline」。（注意：我不能簡單地切換事物，因爲在我自己的代碼中，它位於一個apply函數中，並且循環遍歷數據中的每一行。所以，而不是pets[1,1]它是pets[n,1]最後我打算cbind對事件流數據的結果做進一步分析。）

我在繞包裝如何做到這一點時遇到了麻煩。有什麼建議？

來源

2017-11-10 JoeM05

看來，鍵總是每個「長字符串」的第二個字。這是一個合理的假設嗎？ – useR

不幸的是，沒有。字符串有幾個到幾個不同的單詞。我只知道「關鍵」字在那裏。 – JoeM05

然後你應該提供一個不符合這個假設的長字符串。此外，請提供您的數據集，並將'dput（my_data）'的輸出複製並粘貼到您的問題中，而不是您目前如何格式化它的數據集 – useR

你可以使用包fuzzyjoin很容易做這些事情。

在這裏，您可以使用regex_left_join，它的工作原理就像一個正常的左連接（如dplyr::left_join），不同之處在於，對於rwos是匹配的標準是由正則表達式來確定匹配似stringr::str_detect

library(tibble) 
library(fuzzyjoin) 

pets <- tribble(
          ~pets, 
    "one calico cat that\'s smart", 
      "German Shepard dog", 
    "A Chameleon that is a Lizard", 
       "a cute tabby cat", 
       "the fish guppy", 
        "Lizard Gecko", 
      "German Shepard dog", 
        "Budgie Bird", 
    "Canary Bird in a coal mine", 
       "a chihuahua dog" 
) 

key <- tribble(
     ~key, ~classification, 
     "dog",  "canine", 
     "cat",  "feline", 
    "lizard",  "reptile", 
    "bird",   "avian", 
    "fish",   "fish" 
) 

regex_left_join(pets, key, by = c("pets" = "key"), ignore_case = TRUE) 

#> # A tibble: 10 x 3 
#>       pets key classification 
#>       <chr> <chr>   <chr> 
#> 1 one calico cat that's smart cat   feline 
#> 2   German Shepard dog dog   canine 
#> 3 A Chameleon that is a Lizard lizard  reptile 
#> 4    a cute tabby cat cat   feline 
#> 5    the fish guppy fish   fish 
#> 6     Lizard Gecko lizard  reptile 
#> 7   German Shepard dog dog   canine 
#> 8     Budgie Bird bird   avian 
#> 9 Canary Bird in a coal mine bird   avian 
#> 10    a chihuahua dog dog   canine

來源

2017-11-10 22:53:29 austensen

這工作。方便的圖書館，謝謝奧地利人 – JoeM05

您可以構建每個寵物密鑰列表，然後看看他們在表

Pattern = paste(KeyTable$key, collapse="|") 
Pattern = paste0(".*(", Pattern, ").*") 
Type = tolower(sub(Pattern, "\\1", ignore.case=TRUE, Pets)) 
KeyTable$classification[match(Type, KeyTable$key)] 
[1] "feline" "canine" "reptile" "feline" "feline" "canine" "fish" 
[8] "reptile" "canine" "avian" "avian" "canine"

數據

KeyTable = read.table(text="key classification 
dog canine 
cat feline 
lizard reptile 
bird avian  
fish fish", 
header=TRUE, stringsAsFactors=FALSE) 

Pets = c("calico cat", 
"Shepard dog" , 
"Chameleon Lizard", 
"calico cat", 
"tabby cat", 
"chihuahua dog", 
"guppy fish", 
"Gecko Lizard", 
"Shepard dog", 
"Budgie Bird", 
"Canary Bird" , 
"chihuahua dog")

來源

2017-11-10 22:54:20 G5W

下面是使用另一種方法hashmap：

library(hashmap) 

hash_table = hashmap(Lookup$key, Lookup$classification) 

Pets %>% 
    separate_rows(Pets, sep = " ") %>% 
    mutate(class = hash_table[[tolower(Pets)]]) %>% 
    na.omit() %>% 
    select(Key = Pets, class) %>% 
    bind_cols(Pets, .)

結果：

> hash_table 
## (character) => (character) 
##  [fish] => [fish]  
##  [bird] => [avian]  
## [lizard] => [reptile] 
##  [cat] => [feline] 
##  [dog] => [canine] 

          Pets Key class 
1 one calico cat that's smart cat feline 
2   German Shepard dog dog canine 
3 A Chameleon that is a Lizard Lizard reptile 
4    a cute tabby cat cat feline 
5    the fish guppy fish fish 
6     Lizard Gecko Lizard reptile 
7   German Shepard dog dog canine 
8     Budgie Bird Bird avian 
9 Canary Bird in a coal mine Bird avian 
10    a chihuahua dog dog canine

數據：

Pets = structure(list(Pets = c("one calico cat that's smart", "German Shepard dog", 
           "A Chameleon that is a Lizard", "a cute tabby cat", "the fish guppy", 
           "Lizard Gecko", "German Shepard dog", "Budgie Bird", "Canary Bird in a coal mine", 
           "a chihuahua dog")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame") 


Lookup = structure(list(key = c("dog", "cat", "lizard", "bird", "fish"), 
         classification = c("canine", "feline", "reptile", "avian", 
         "fish")), class = "data.frame", .Names = c("key", "classification" 
        ), row.names = c(NA, -5L))

來源

2017-11-11 01:59:09 useR

使用子字符串R查找字符串R

回答

相關問題