2017-03-05 29 views
3

如果我有以下數據表:Correlationmatrix從數據表

set.seed(1) 
TDT <- data.table(Group = c(rep("A",40),rep("B",60)), 
         Id = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)), 
         Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5), 
         norm = round(runif(100)/10,2), 
         x1 = sample(100,100), 
         x2 = round(rnorm(100,0.75,0.3),2), 
         x3 = round(rnorm(100,0.75,0.3),2), 
         x4 = round(rnorm(100,0.75,0.3),2), 
         x5 = round(rnorm(100,0.75,0.3),2)) 

我怎樣才能通過時間計算X1,X2,X3,X4和X5之間的關係?

此:

TDT[,x:= list(cor(TDT[,5:9])), by = Time] 

不起作用。

如何在datatable中完成?

+0

你的數據不具備標識和時間的每個組合的多次觀察,因爲有必要計算的相關性。試試'TDT [Id == 1&Time ==「2010-01-02」]',或Id和Time的任何其他組合。每個只有一行。 –

+0

@玫瑰哈特曼對不起,我的意思只是時間 – user3032689

回答

1

你這麼親近你的嘗試!你錯過的是一個額外的list()

這工作:

TDT[,x:= list(list(cor(TDT[,5:9]))), by = Time] 

而且TDT$x回報:

[[1]] 
      x1   x2   x3   x4   x5 
x1 1.00000000 0.72185099 0.07368766 -0.7031890 -0.36895449 
x2 0.72185099 1.00000000 0.68058833 -0.7393130 0.05066973 
x3 0.07368766 0.68058833 1.00000000 -0.5021462 0.10645894 
x4 -0.70318896 -0.73931299 -0.50214616 1.0000000 0.11671020 
x5 -0.36895449 0.05066973 0.10645894 0.1167102 1.00000000 

[[2]] 
      x1   x2   x3   x4   x5 
x1 1.0000000 -0.1011948 -0.85191422 -0.15571603 0.4855237 
x2 -0.1011948 1.0000000 0.56691559 -0.44002621 -0.6699172 
x3 -0.8519142 0.5669156 1.00000000 0.02189754 -0.6168013 
x4 -0.1557160 -0.4400262 0.02189754 1.00000000 0.2236542 
x5 0.4855237 -0.6699172 -0.61680132 0.22365419 1.0000000 

[...] 

額外list()是因爲如何data.table解析DT[1,2]語法的第二個要素需要。這已在其他地方的stackoverflow中進行了深入討論,我邀請您閱讀most excellent answer

作爲一個方面說明,似乎最好用.()替換最外面的呼叫list()以闡明意圖。我還想明確列出參考.SD.SDcols的列。在相同的結果,你可以重寫你的代碼爲:

TDT[, x := .(list(cor(.SD))), by = Time, .SDcols = 5:9] 
1

您可能會發現corrr程序包對此很有用。結合dplyr命令,您可以輕鬆獲得不同組的相關矩陣。

library(data.table) # not necessary unless you want the data in this format for other reasons 
library(dplyr) 
library(corrr) 

每個ID獲取相關矩陣:

> TDT %>% 
+ group_by(Id) %>% 
+ do({ 
+  correlate(select(., x1:x5)) 
+  }) 
Source: local data frame [25 x 7] 
Groups: Id [5] 

     Id rowname   x1   x2   x3   x4   x5 
    <dbl> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
1  1  x1   NA -0.246252411 -0.24589380 -0.181120555 0.14781414 
2  1  x2 -0.24625241   NA 0.32098291 -0.175603686 -0.08863810 
3  1  x3 -0.24589380 0.320982911   NA 0.161336670 0.07934436 
4  1  x4 -0.18112056 -0.175603686 0.16133667   NA -0.19662700 
5  1  x5 0.14781414 -0.088638098 0.07934436 -0.196627000   NA 
6  2  x1   NA 0.075760735 0.41276725 0.425032505 0.37178993 
7  2  x2 0.07576074   NA 0.07747543 -0.004202306 -0.08086958 
8  2  x3 0.41276725 0.077475426   NA 0.248151847 0.07619264 
9  2  x4 0.42503251 -0.004202306 0.24815185   NA 0.37647798 
10  2  x5 0.37178993 -0.080869584 0.07619264 0.376477979   NA 
# ... with 15 more rows 

獲取相關矩陣的每個時間:

> TDT %>% 
+ group_by(Time) %>% 
+ do({ 
+  correlate(select(., x1:x5)) 
+ }) 
Source: local data frame [100 x 7] 
Groups: Time [20] 

     Time rowname   x1   x2   x3   x4   x5 
     <date> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
1 2010-01-02  x1   NA -0.66584960 -0.58788152 0.92540707 0.37316217 
2 2010-01-02  x2 -0.66584960   NA -0.06102424 -0.69292534 0.19440850 
3 2010-01-02  x3 -0.58788152 -0.06102424   NA -0.54623949 -0.78714932 
4 2010-01-02  x4 0.92540707 -0.69292534 -0.54623949   NA 0.53697784 
5 2010-01-02  x5 0.37316217 0.19440850 -0.78714932 0.53697784   NA 
6 2010-02-02  x1   NA -0.10444724 -0.62424401 0.30109335 0.04834057 
7 2010-02-02  x2 -0.10444724   NA -0.12010431 0.08966978 -0.68762698 
8 2010-02-02  x3 -0.62424401 -0.12010431   NA -0.92782037 0.52099983 
9 2010-02-02  x4 0.30109335 0.08966978 -0.92782037   NA -0.58214861 
10 2010-02-02  x5 0.04834057 -0.68762698 0.52099983 -0.58214861   NA 
# ... with 90 more rows 
+0

非常好,ty。你也知道如何在'data table'中做到這一點? – user3032689

+0

不,我不會,對不起:) –

1

split通過Time,然後爲每個子組

運行 cor
lapply(split(TDT, TDT$Time), function(a) cor(a[,5:9])) 

#OR 

lapply(split(TDT[,5:9], TDT$Time), cor) 
+0

謝謝,它也有效,但它並不使用'datatable'syntax。 – user3032689

+0

@ user3032689,'TDT [,5:9] [,cor(.SD),by = TDT $ Time]'? –

+1

哦,這很有效,但對我來說,你可以用'時間'來分割,這在'TDT [,5:9]'中不再包含',這似乎很奇怪。 – user3032689