2017-06-06 76 views
0

我不確定這是否符合質詢的問題,但我需要幫助才能使我的編碼更有效率。我認爲這可以做得更有效率,我只是很糟糕的寫作功能,也許看到答案會幫助我改進。轉發期間相關性R

例如:我有時間序列數據,並希望計算指標Y相關性,以指導我的X值(多個X)的期貨週期變化。 (dput最後)。

我的解決辦法:

str(data.dt) 
Classes ‘data.table’ and 'data.frame': 210 obs. of 3 variables: 
$ id  : chr "X1" "X1" "X1" "X1" ... 
$ date : Date, format: "2016-11-18" "2016-11-25" "2016-12-02" "2016-12-09" ... 
$ PX_LAST: num 2.72 2.76 2.86 2.81 2.83 ... 
- attr(*, ".internal.selfref")=<externalptr> 

#separate indicator value 
y.dt <- data.dt[id=="Y"] 

#add indicator as own column for each X 
step1.dt <- y.dt[data.dt, on="date"] 
#rename 
correl.dt <- step1.dt[, .(date=date, x_id=i.id, x_value=i.PX_LAST, y_id = id, y_value=PX_LAST)] 
#discard NAs and Y from x_id 
correl.dt <- na.omit(correl.dt[x_id != "Y"]) 
#calculate change for each X 
correl.dt[, x.chg := c(rep(NA, 1), diff(x_value, 1)), by=list(x_id)] 
#create forward change by leading changes 
correl.dt[, fwd.xchg := shift(x.chg, type='lead', 1), by = list(x_id)] 

#create multiple Y changes to test correlations 
correl.dt[, y.chg1 := c(rep(NA, 1), diff(y_value, 1)), by=list(x_id)] 
correl.dt[, y.chg2 := c(rep(NA, 2), diff(y_value, 2)), by=list(x_id)] 
correl.dt[, y.chg3 := c(rep(NA, 3), diff(y_value, 3)), by=list(x_id)] 
correl.dt[, y.chg4 := c(rep(NA, 4), diff(y_value, 4)), by=list(x_id)] 
correl.dt[, y.chg5 := c(rep(NA, 5), diff(y_value, 5)), by=list(x_id)] 
correl.dt[, y.chg6 := c(rep(NA, 6), diff(y_value, 6)), by=list(x_id)] 

#cbind results together 
cbind(correl.dt[, cor(fwd.xchg, y.chg1, method='spearman', use='pairwise'), by=.(x_id)], 
     correl.dt[, cor(fwd.xchg, y.chg2, method='spearman', use='pairwise'), by=.(x_id)][,2], 
     correl.dt[, cor(fwd.xchg, y.chg3, method='spearman', use='pairwise'), by=.(x_id)][,2], 
     correl.dt[, cor(fwd.xchg, y.chg4, method='spearman', use='pairwise'), by=.(x_id)][,2], 
     correl.dt[, cor(fwd.xchg, y.chg5, method='spearman', use='pairwise'), by=.(x_id)][,2], 
     correl.dt[, cor(fwd.xchg, y.chg6, method='spearman', use='pairwise'), by=.(x_id)][,2]) 

結果,是沒有意義的,因爲我有非常小的子集。此外,我選擇了短時間的相關性來適合我的子集。幫助表示讚賞,什麼是測試前向相關性的最佳方法。我愛上了數據表,還不是很擅長,但還在改進。我有大約100-200個指標要測試。

這裏是dput:

structure(list(id = c("X1", "X1", "X1", "X1", "X1", "X1", "X1", 
"X1", "X1", "X1", "X1", "X1", "X1", "X1", "X1", "X1", "X1", "X1", 
"X1", "X1", "X1", "X1", "X1", "X1", "X1", "X1", "X1", "X1", "X1", 
"X1", "X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", 
"X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", 
"X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", "X2", "X3", "X3", 
"X3", "X3", "X3", "X3", "X3", "X3", "X3", "X3", "X3", "X3", "X3", 
"X3", "X3", "X3", "X3", "X3", "X3", "X3", "X3", "X3", "X3", "X3", 
"X3", "X3", "X3", "X3", "X3", "X3", "X4", "X4", "X4", "X4", "X4", 
"X4", "X4", "X4", "X4", "X4", "X4", "X4", "X4", "X4", "X4", "X4", 
"X4", "X4", "X4", "X4", "X4", "X4", "X4", "X4", "X4", "X4", "X4", 
"X4", "X4", "X4", "X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", 
"X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", 
"X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", "X5", 
"X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", 
"X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", 
"X6", "X6", "X6", "X6", "X6", "X6", "X6", "X6", "Y", "Y", "Y", 
"Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", 
"Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", 
"Y"), date = structure(c(17123L, 17130L, 17137L, 17144L, 17151L, 
17158L, 17165L, 17172L, 17179L, 17186L, 17193L, 17200L, 17207L, 
17214L, 17221L, 17228L, 17235L, 17242L, 17249L, 17256L, 17263L, 
17270L, 17277L, 17284L, 17291L, 17298L, 17305L, 17312L, 17319L, 
17326L, 17123L, 17130L, 17137L, 17144L, 17151L, 17158L, 17165L, 
17172L, 17179L, 17186L, 17193L, 17200L, 17207L, 17214L, 17221L, 
17228L, 17235L, 17242L, 17249L, 17256L, 17263L, 17270L, 17277L, 
17284L, 17291L, 17298L, 17305L, 17312L, 17319L, 17326L, 17123L, 
17130L, 17137L, 17144L, 17151L, 17158L, 17165L, 17172L, 17179L, 
17186L, 17193L, 17200L, 17207L, 17214L, 17221L, 17228L, 17235L, 
17242L, 17249L, 17256L, 17263L, 17270L, 17277L, 17284L, 17291L, 
17298L, 17305L, 17312L, 17319L, 17326L, 17123L, 17130L, 17137L, 
17144L, 17151L, 17158L, 17165L, 17172L, 17179L, 17186L, 17193L, 
17200L, 17207L, 17214L, 17221L, 17228L, 17235L, 17242L, 17249L, 
17256L, 17263L, 17270L, 17277L, 17284L, 17291L, 17298L, 17305L, 
17312L, 17319L, 17326L, 17123L, 17130L, 17137L, 17144L, 17151L, 
17158L, 17165L, 17172L, 17179L, 17186L, 17193L, 17200L, 17207L, 
17214L, 17221L, 17228L, 17235L, 17242L, 17249L, 17256L, 17263L, 
17270L, 17277L, 17284L, 17291L, 17298L, 17305L, 17312L, 17319L, 
17326L, 17123L, 17130L, 17137L, 17144L, 17151L, 17158L, 17165L, 
17172L, 17179L, 17186L, 17193L, 17200L, 17207L, 17214L, 17221L, 
17228L, 17235L, 17242L, 17249L, 17256L, 17263L, 17270L, 17277L, 
17284L, 17291L, 17298L, 17305L, 17312L, 17319L, 17326L, 17123L, 
17130L, 17137L, 17144L, 17151L, 17158L, 17165L, 17172L, 17179L, 
17186L, 17193L, 17200L, 17207L, 17214L, 17221L, 17228L, 17235L, 
17242L, 17249L, 17256L, 17263L, 17270L, 17277L, 17284L, 17291L, 
17298L, 17305L, 17312L, 17319L, 17326L), class = "Date"), PX_LAST = c(2.719, 
2.761, 2.863, 2.815, 2.831, 2.872, 2.765, 2.681, 2.692, 2.783, 
2.779, 2.795, 2.696, 2.803, 2.73, 2.807, 2.977, 2.861, 2.75, 
2.701, 2.551, 2.474, 2.538, 2.575, 2.648, 2.635, 2.475, 2.41, 
2.412, 2.373, 1.579, 1.56, 1.619, 1.73, 1.833, 1.796, 1.721, 
1.731, 1.715, 1.751, 1.782, 1.766, 1.697, 1.711, 1.607, 1.702, 
1.811, 1.761, 1.642, 1.625, 1.596, 1.494, 1.47, 1.547, 1.542, 
1.571, 1.475, 1.445, 1.4, 1.413, 1.455, 1.417, 1.38, 1.453, 1.438, 
1.345, 1.239, 1.383, 1.364, 1.431, 1.471, 1.352, 1.256, 1.211, 
1.078, 1.185, 1.231, 1.244, 1.196, 1.139, 1.075, 1.043, 1.034, 
1.085, 1.117, 1.086, 1.093, 1.012, 1.038, 1.02, 0.272, 0.24, 
0.281, 0.365, 0.314, 0.221, 0.208, 0.298, 0.338, 0.421, 0.462, 
0.412, 0.32, 0.302, 0.186, 0.356, 0.485, 0.435, 0.403, 0.328, 
0.228, 0.187, 0.253, 0.317, 0.418, 0.391, 0.368, 0.331, 0.274, 
0.268, 2.3548, 2.3572, 2.3831, 2.4675, 2.5916, 2.5373, 2.4443, 
2.4193, 2.3964, 2.4668, 2.4843, 2.4648, 2.4073, 2.4147, 2.3117, 
2.478, 2.5745, 2.5005, 2.4123, 2.3874, 2.3822, 2.2374, 2.248, 
2.2802, 2.3487, 2.3257, 2.2346, 2.2465, 2.1591, 2.1538, 0.517, 
0.534, 0.559, 0.611, 0.64, 0.615, 0.556, 0.628, 0.628, 0.699, 
0.749, 0.71, 0.665, 0.678, 0.549, 0.694, 0.774, 0.75, 0.673, 
0.605, 0.548, 0.516, 0.564, 0.587, 0.653, 0.572, 0.518, 0.514, 
0.425, 0.43, 0.8906, 0.895, 0.8999, 0.9062, 0.89, 0.8864, 0.8802, 
0.8839, 0.8964, 0.899, 0.9145, 0.9039, 0.9054, 0.9044, 0.8934, 
0.8978, 0.9041, 0.9048, 0.8979, 0.9023, 0.892, 0.8842, 0.8942, 
0.9107, 0.9121, 0.9163, 0.8944, 0.8965, 0.8995, 0.8965)), row.names = c(NA, 
-210L), class = c("data.table", "data.frame"), .Names = c("id", 
"date", "PX_LAST"), .internal.selfref = <pointer: 0x003c24a0>) 
+0

之前,我嘗試任何愚蠢的事,我有data.table鑑賞家側的問題:是有關聯的任何風險用這個'dput'和一個顯式指針呢?像覆蓋內存中的東西? –

回答

0

這是我想出了。有點快,因爲分配變量的次數較少,但性能增益不是那麼高。可能是代碼的簡單性是最大的優勢

library(dplyr) 
lapply(as.list(1:6), 
    function(x) {correl.dt[, cor(fwd.xchg, y_value - shift(y_value, x), 
            method='spearman', use='pairwise'), by=.(x_id)][, 2]}) %>% 
do.call(cbind, .) 

下面是一個風向標:

my_code <- function(){ 
    lapply(as.list(1:6), 
     function(x) {correl.dt[, cor(fwd.xchg, y_value - shift(y_value, x), 
             method='spearman', use='pairwise'), by=.(x_id)][, 2]}) %>% 
    do.call(cbind, .) 

} 

your_code <- function(){ 
    #create multiple Y changes to test correlations 
    correl.dt[, y.chg1 := c(rep(NA, 1), diff(y_value, 1)), by=list(x_id)] 
    correl.dt[, y.chg2 := c(rep(NA, 2), diff(y_value, 2)), by=list(x_id)] 
    correl.dt[, y.chg3 := c(rep(NA, 3), diff(y_value, 3)), by=list(x_id)] 
    correl.dt[, y.chg4 := c(rep(NA, 4), diff(y_value, 4)), by=list(x_id)] 
    correl.dt[, y.chg5 := c(rep(NA, 5), diff(y_value, 5)), by=list(x_id)] 
    correl.dt[, y.chg6 := c(rep(NA, 6), diff(y_value, 6)), by=list(x_id)] 

    #cbind results together 
    cbind(correl.dt[, cor(fwd.xchg, y.chg1, method='spearman', use='pairwise'), by=.(x_id)], 
     correl.dt[, cor(fwd.xchg, y.chg2, method='spearman', use='pairwise'), by=.(x_id)][,2], 
     correl.dt[, cor(fwd.xchg, y.chg3, method='spearman', use='pairwise'), by=.(x_id)][,2], 
     correl.dt[, cor(fwd.xchg, y.chg4, method='spearman', use='pairwise'), by=.(x_id)][,2], 
     correl.dt[, cor(fwd.xchg, y.chg5, method='spearman', use='pairwise'), by=.(x_id)][,2], 
     correl.dt[, cor(fwd.xchg, y.chg6, method='spearman', use='pairwise'), by=.(x_id)][,2]) 
} 

microbenchmark::microbenchmark(my_code(), your_code()) 
## Unit: milliseconds 
##  expr  min  lq  mean median  uq  max neval 
## my_code() 8.818589 9.160749 9.47846 9.293391 9.605331 13.12389 100 
## your_code() 11.068776 11.436789 11.98425 11.600102 11.926878 16.94066 100 
+0

謝謝,我的問題是一步一步地做,所以樂寶,地圖和其他有用的捷徑都很好。 – Viitama