在R中的數據框的列中搜索多個元素

我有一個數據框，並且我有一個ID列表，我想要搜索以檢查它們是否在該數據框中。這個數據幀看起來這樣：在R中的數據框的列中搜索多個元素

dput(bed,"mybed.bed") 
sample <- c("13874.p1", "13609.p1","12736.p1", "11970.p1","12025.p1","12189.p1","12529.p1","11522.p1","11716.p1","13684.p1")

我想返回包含由sample矢量和df$sample_ID共享的值中的任何一個數據幀的行。

我試過使用sapply(samples, grepl, df$sample_ID)，但它只是在檢查樣本的第一個元素存在時。任何幫助，將不勝感激！！

來源

2017-02-18 Workhorse

sry誤讀了這個問題。你能提供樣本數據嗎？ – BigDataScientist

它在問題中。 – Workhorse

很難進入R，你可以提供'dput（）' – BigDataScientist

我想我得到了一個解決方案，使用：str_locate_all從stringr包。例如：

v <- c("abc11", "abc11abc11", "abc11abc11abc11abc") 
library(stringr) 
result1 <- str_locate_all(v[1], "11") 
result2 <- str_locate_all(v[2], "11") 
result3 <- str_locate_all(v[3], "11")

輸出將顯示每場比賽一排的一對值的啓動端（的匹配）：

> result1 
[[1]] 
    start end 
[1,]  4 5 

> result2 
[[1]] 
    start end 
[1,]  4 5 
[2,]  9 10 

> result3 
[[1]] 
    start end 
[1,]  4 5 
[2,]  9 10 
[3,] 14 15 
>

結果存儲在一個小的不舒服結構：

> class(result3) 
[1] "list" 
> length(result3) 
[1] 1 
>

這唯一元件是一個整數矩陣：

> class(result3[[1]]) 
[1] "matrix" 
> dim(result3[[1]]) 
[1] 3 2 
>

功能str_locate提供更簡單的輸出，但它只會找到第一個匹配項。

我的建議將提取列表的第一要素，然後用它操作，例如：

m <- result3[[1]]

不是會更容易獲得訪問存儲的信息對於result3矩陣3x2的：

> m 
    start end 
[1,]  4 5 
[2,]  9 10 
[3,] 14 15

現在，要知道比賽的數量：

> nrow(m) 
[1] 3

或dim(m)[1]。

因此，以矩陣形式存儲的結果是更容易提取信息。爲了讓所有的輸入參數匹配的位置剛剛提取的第一列：

> m[,1] 
[1] 4 9 14

--------------------------------------------------------------------------------

編輯

應用以前的概念，原來的問題，即找到一個匹配項一個n值數組中的m-模式陣列。

--------------------------------------------------------------------------------

回到我的理解是你的問題，讓我們說我們有以下的數據幀：

df = data.frame(ID = c(1,2,3,4), 
sample_ID = c(
    "12613.p1", 
    "12613.p1", 
    "11401.p1,11120.p1,11199.p1,11226.p1,11395.p1,11296.p1,11333.p1,11374.p1,11388.p1,11395.p1,11420.p1", 
    "11401.p1,13863.p1"), 
stringsAsFactors = F)

現在，我們有以下的樣本矢量：

sample <- c("11120.p1", "11395.p1", "12613.p1", "13863.p1", "11401.p1")

的df有4行，並且sample數組有5行。現在，根據前面的解釋，讓搜索在df$sample_ID，我們可以使用lapply功能找到的sample元素：

library(stringr) 
all <- sapply(df$sample_ID, FUN = function(x) {return(str_locate_all(x, sample))})

現在輸出將是：

> class(all) 
[1] "matrix"

其中

> dim(all) 
[1] 5 4

因此，對於sample的每個元素，我們有5列，結果來自的給定行（四列）。

我們預計的sample以下匹配的每個元素：

Sample | df$sample_ID[1] | df$sample_ID[2] | df$sample_ID[3] | df$sample_ID[4] 
------- | -----------------|------------------|-----------------|--------------- 
11120.p1 |  0   |  0   |  1   |  0  
11395.p1 |  0   |  0   |  2   |  0  
12613.p1 |  1   |  1   |  0   |  0  
13863.p1 |  0   |  0   |  0   |  1  
11401.p1 |  1   |  0   |  1   |  0

這是所獲得的結果：

> all 
    12613.p1 12613.p1 
[1,] Integer,0 Integer,0 
[2,] Integer,0 Integer,0 
[3,] Integer,2 Integer,2 
[4,] Integer,0 Integer,0 
[5,] Integer,0 Integer,0 
    11401.p1,11120.p1,11199.p1,11226.p1,11395.p1,11296.p1,11333.p1,11374.p1,11388.p1,11395.p1,11420.p1 
[1,] Integer,2                       
[2,] Integer,4                       
[3,] Integer,0                       
[4,] Integer,0                       
[5,] Integer,2                       
    11401.p1,13863.p1 
[1,] Integer,0   
[2,] Integer,0   
[3,] Integer,0   
[4,] Integer,2   
[5,] Integer,2   
>

矩陣的每個元素是一個list。以下是如何理解結果，對於每個[row, col]它提供了有關list元素的彙總信息：Integer,n將指示給定單元的元素數量。對於每場比賽，我們有兩個值：[start,end]，因此對於m比賽我們將有m x 2。這就是爲什麼[row, col] = [2,3]它的值爲4。

要提取的信息，可以說對於比賽的價值：（df$sample_ID[3]）sample[2]=11395.p1第三排有：

> all[2,3] 
$`11401.p1,11120.p1,11199.p1,11226.p1,11395.p1,11296.p1,11333.p1,11374.p1,11388.p1,11395.p1,11420.p1` 
    start end 
[1,] 37 44 
[2,] 82 89

提取所有匹配的位置：

> all[2,3][[1]][,1] 
[1] 37 82

例如：m <- all[2,3][[1]] then：

> m[,1] 
[1] 37 82

如何識別不匹配的情況？

讓我們挑原有矩陣，其中沒有匹配的元素[1,1]，則：

> m <- all[1,1][[1]] 
> dim(m) 
[1] 0 2 
> dim(m)[1] 
[1] 0 
>

我希望現在這解決您的具體問題。

來源

2017-02-18 19:06:43

呼叫：

unique(do.call(c, sapply(X = sample, FUN = function(x){return(grep(pattern = x,x = df$sample_id))})))

應該工作：

> df = data.frame(chrom = c(1,2,1,1), 
+     sample_id = c("12613.p1", "12613.p1","11118.p1,11120.p1,11199.p1,11226.p1,11285.p1,11296.p1,11333.p1,11374.p1,11388.p1,11395.p1,11420.p1", "11401.p1,13863.p1"), 
+     stringsAsFactors = F) 
> 
> 
> 
> sample <- c("13874.p1", "13609.p1","12736.p1", "11970.p1","12025.p1", 
+    "12189.p1","12529.p1","11522.p1","11716.p1","13684.p1") 
> 
> 
> unique(do.call(c, sapply(X = sample, FUN = function(x) {return(grep(pattern = x,x = df$sample_id))}))) 
integer(0)

無解

但是，如果我最後一個字符串添加到樣本：

> sample <- c("13874.p1", "13609.p1","12736.p1", "11970.p1","12025.p1", 
+    "12189.p1","12529.p1","11522.p1","11716.p1","13684.p1", 
+    "11199.p1") 
> 
> 
> unique(do.call(c, sapply(X = sample, FUN = function(x){return(grep(pattern = x,x = df$sample_id))}))) 
[1] 3

它的作品！

來源

2017-02-18 19:08:37 Ooona

我想我找到了一個簡單的解決方案來解決這個問題（道歉爲不張貼更真實的數據，我的數據集是巨大的）。

所以我有一個ID的字符向量，sample。然後我有一個表，其中一列包含每行ID的列表。

hits <- c() 
for(i in sample){ 
     hits <- append(hits, which(grepl(i, df$sample_ID, fixed = TRUE))) 
} 

hits2 <- unique(hits)

我只是去通過sample向量，每一次我檢查，看它是否在每個DF $ sample_ID列表中存在。它返回每個正點擊的行數（來自數據框）。由於某些行可能有2個匹配項，因此我刪除了重複項。

我可以基於這些行然後子集。

df2 <- df[hits2,]

來源

2017-02-19 01:49:09 Workhorse

在R中的數據框的列中搜索多個元素

回答

相關問題