最快的R實現在

我使用"%within[]%" <- function(x,y){x>=y[1] & x<=y[2]}（意思是x在緊湊集合y）在R代碼很多，但我非常確定這是非常緩慢。你有更快的東西嗎？它需要爲定義>的所有內容工作。最快的R實現在

編輯：x可以是載體和y升序2個elments矢量...

EDIT2：奇怪的是，沒有人（據我所知）寫了一包rOperator實現快速C運營商如%w/i[]%, %w/i[[%, ...

編輯3：我意識到我的問題太籠統了，因爲在x,y上做出的假設會修改任何結果，我想我們應該關閉它，謝謝您的意見。

來源

2013-02-21 statquant

你的意思是'y'總是一個二元向量，'y [1] juba 2013-02-21 12:53:14

函數（...）{return（...）}'=>'函數（...）...' – 2013-02-21 12:56:57

您可能需要'＆'而不是'&&'。 – Roland 2013-02-21 12:59:53

"%within[]%" <- function(x,y){x>=y[1] & x<=y[2]} 

x <- 1:10 
y <- c(3,5) 

x %within[]% y 
"%within[]2%" <- function(x,y) findInterval(x,y,rightmost.closed=TRUE)==1 
x %within[]2% y 

library(microbenchmark) 

microbenchmark(x %within[]% y,x %within[]2% y) 

Unit: microseconds 
      expr min lq median uq max 
1 x %within[]% y 1.849 2.465 2.6185 2.773 11.395 
2 x %within[]2% y 4.928 5.544 5.8520 6.160 37.265 

x <- 1:1e6 
microbenchmark(x %within[]% y,x %within[]2% y) 

Unit: milliseconds 
      expr  min  lq median  uq  max 
1 x %within[]% y 27.81535 29.60647 31.25193 56.68517 88.16961 
2 x %within[]2% y 20.75496 23.07100 24.37369 43.15691 69.62122

這可能是Rcpp的工作。

來源

2013-02-21 13:09:44 Roland

+1 TIL如何正確使用'microbenchmark'。 – juba 2013-02-21 13:21:12

好了，我不知道這是否可以考慮緩慢或沒有，但這裏是一個有點基準：

R> within <- function(x,y){return(x>=y[1] & x<=y[2])} 
R> microbenchmark(within(2,c(1,5))) 
Unit: microseconds 
       expr min  lq median uq max neval 
within(2, c(1, 5)) 2.667 2.8305 2.9045 2.969 15.818 100 

R> within2 <- function(x,y) x>=y[1] & x<=y[2] 
R> microbenchmark(within2(2,c(1,5))) 
Unit: microseconds 
       expr min  lq median uq max neval 
within2(2, c(1, 5)) 2.266 2.3205 2.398 2.483 12.472 100 

R> microbenchmark(2>=1 & 2<=5) 
Unit: nanoseconds 
      expr min lq median uq max neval 
2 >= 1 & 2 <= 5 781 821.5 850 911 5701 100

如此看來，省略return，由康拉德·魯道夫的建議，速度有點事情。但不寫函數要快得多。

來源

2013-02-21 13:02:25 juba

我的建議實際上並沒有像樣式那麼多（在這裏''return'函數調用只是多餘的）。但是，對於這些結果並不感到驚訝。 – 2013-02-21 13:05:51

&&只比較向量的第一要素：1：4 < 4 && 1:4 > 2提供虛假和不（FALSE，FALSE，TRUE，FALSE） – 2013-02-21 13:08:49

@JanvanderLaan是的，我知道，但如果你不想矢量操作，'&& '有點快。 – juba 2013-02-21 13:10:39

如果x包含許多值，則基於樹的結構可提供更好的性能。如果您可以將您的要求限制爲數值，則有2個選項

可以在Bioconductor軟件包IRanges中找到整數間隔樹的實現。

默認情況下，RSQLite正在編譯啓用rtrees的嵌入式SQLite庫。這可以用於任何數值。

來源

2013-02-21 13:06:33 lgautier

我意識到，這是比我想象的，理想的，應與數字（所以一切都像日期，POSIXct ......），而且字符（帶字典順序）的工作更加複雜。 – statquant 2013-02-21 13:10:09

將日期轉換爲整數（自紀元起的毫秒數）是微不足道的（您的假名錶明您沒有使用歷史或歷史之前的日期）。字符串並不特別，如何做到這一點需要你做出設計決定（前綴匹配？後綴匹配？嚴格相同的長度？） – lgautier 2013-02-21 13:18:23

是的，不確定是否需要轉換，因爲POSIXct和Date存儲爲double（奇怪的是Date）內部。理解字符串... – statquant 2013-02-21 13:25:53

你可以用一個簡單的RCPP實現一個小的性能改進：

library(Rcpp) 
library(microbenchmark) 

withinR <- function(x,y) x >= y[1] & x <= y[2] 
cppFunction("LogicalVector withinCpp(const NumericVector& x, const NumericVector& y) { 
    double min = y[0], max = y[1]; 

    int n = x.size(); 
    LogicalVector out(n); 

    for(int i = 0; i < n; ++i) { 
    double val = x[i]; 
    if (NumericVector::is_na(val)) { 
     out[i] = NA_LOGICAL; 
    } else { 
     out[i] = val >= min & val <= max; 
    } 

    } 
    return out; 
}") 

x <- sample(100, 1e5, rep = T) 

stopifnot(all.equal(withinR(x, c(25, 50)), withinCpp(x, c(25, 50)))) 

microbenchmark(
    withinR(x, c(25, 50)), 
    withinCpp(x, c(25, 50)) 
)

C++版本是4倍左右我的電腦上更快。如果你想要使用更多的Rcpp技巧，你可以進一步調整它，但這看起來已經很快了。即使是R版本也需要在可能出現瓶頸之前非常頻繁地調用。

# Unit: microseconds 
#      expr min lq median uq max 
# 1 withinCpp(x, c(25, 50)) 635 659 678 1012 27385 
# 2 withinR(x, c(25, 50)) 1969 2031 2573 2954 4082

來源

2013-02-21 14:50:26 hadley

最快的R實現在

回答

相關問題