再採樣中的R

假設我們有以下數據再採樣中的R

set.seed(123) 
dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4)) 
(sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE)) 
> [1] 2 4 2

的sampledIDs是從dat$id採樣（與替換）的ID的向量。我需要導致（和作品也爲大量的數據可能有更多的變量）的代碼：

代碼dat[which(dat$id%in%sampledIDs),]不給我我想要的東西，因爲這個代碼的結果是

其中dat$id==2的主題在此數據中只出現一次（我理解爲什麼這是結果，但不知道如何得到我想要的）。有人可以幫忙嗎？

編輯：謝謝你的答案，在這裏所有的答案的運行時間（對於那些有興趣誰）：

                test replications elapsed relative user.self 
3 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]   1000 0.67 1.000  0.64 
1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ]   1000 0.67 1.000  0.67 
2  do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])   1000 1.83 2.731  1.83 
4        setkey(setDT(dat), id)[J(sampledIDs)]   1000 1.33 1.985  1.33

來源

2014-11-03 Giuseppe

+1用於提供答案分析以及明確說明的問題。 – 2014-11-03 12:19:56

數據大小是多少？你提到你有一個大數據 – 2014-11-03 12:21:56

不是很大的數據，但比例子中有更多的觀察/變量：''data.frame'：\t 4454 obs。的15個變量'。 – Giuseppe 2014-11-03 12:28:19

這將可能是一個大數據的最快方法設置使用data.tablebinary search

library(data.table) 
setkey(setDT(dat), id)[J(sampledIDs)] 
# var1 id 
# 1: 13 2 
# 2: 19 2 
# 3: 15 2 
# 4: 19 4 
# 5: 13 2 
# 6: 19 2 
# 7: 15 2

編輯：下面是一個不標杆如此大的數據集（1e + 05行）說明哪個是明顯的贏家

library(data.table) 
library(microbenchmark) 

set.seed(123) 
n <- 1e5 
dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE)) 
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE)) 
dat2 <- copy(dat) 

Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 
Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ] 
flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) 
David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)] 

Res <- microbenchmark(Sven1(dat), 
         Sven2(dat), 
         flodel(dat), 
         David(dat2)) 
Res 
# Unit: milliseconds 
#  expr  min  lq median  uq  max neval 
# Sven1(dat) 4.356151 4.817557 6.715533 7.313877 45.407768 100 
# Sven2(dat) 9.750984 12.385677 14.324671 16.655005 54.797096 100 
# flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879 100 
# David(dat2) 1.813387 2.068749 2.154774 2.335442 8.665379 100 

boxplot(Res)

enter image description here

如果，例如，我們想品嚐更多的則僅有3 ID，但讓說，10，基準變得可笑

(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE)) 
[1] 7 6 10 9 5 9 5 3 7 3 
# Unit: milliseconds 
#  expr  min   lq  median   uq  max neval 
# Sven1(dat) 80.124502 89.141162 97.908365 104.111738 175.40919 100 
# Sven2(dat) 99.010410 127.797966 159.404395 170.751069 209.96887 100 
# flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293 100 
# David(dat2) 2.431682 2.721038 2.855103 3.057796 19.60826 100

enter image description here

來源

2014-11-03 11:49:22

你可以這樣做：

do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])

來源

2014-11-03 11:28:26 flodel

一方法：

dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 
#  var1 id 
# 3  13 2 
# 4  19 2 
# 5  15 2 
# 7  19 4 
# 3.1 13 2 
# 4.1 19 2 
# 5.1 15 2

的另一種方法：

dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]

來源

2014-11-03 11:30:39

再採樣中的R

回答

相關問題