Correlationmatrix從數據表

如果我有以下數據表：Correlationmatrix從數據表

set.seed(1) 
TDT <- data.table(Group = c(rep("A",40),rep("B",60)), 
         Id = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)), 
         Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5), 
         norm = round(runif(100)/10,2), 
         x1 = sample(100,100), 
         x2 = round(rnorm(100,0.75,0.3),2), 
         x3 = round(rnorm(100,0.75,0.3),2), 
         x4 = round(rnorm(100,0.75,0.3),2), 
         x5 = round(rnorm(100,0.75,0.3),2))

我怎樣才能通過時間計算X1，X2，X3，X4和X5之間的關係？

此：

TDT[,x:= list(cor(TDT[,5:9])), by = Time]

不起作用。

如何在datatable中完成？

來源

2017-03-05 user3032689

你的數據不具備標識和時間的每個組合的多次觀察，因爲有必要計算的相關性。試試'TDT [Id == 1＆Time ==「2010-01-02」]'，或Id和Time的任何其他組合。每個只有一行。 –

@玫瑰哈特曼對不起，我的意思只是時間 – user3032689

你這麼親近你的嘗試！你錯過的是一個額外的list()。

這工作：

TDT[,x:= list(list(cor(TDT[,5:9]))), by = Time]

而且TDT$x回報：

[[1]] 
      x1   x2   x3   x4   x5 
x1 1.00000000 0.72185099 0.07368766 -0.7031890 -0.36895449 
x2 0.72185099 1.00000000 0.68058833 -0.7393130 0.05066973 
x3 0.07368766 0.68058833 1.00000000 -0.5021462 0.10645894 
x4 -0.70318896 -0.73931299 -0.50214616 1.0000000 0.11671020 
x5 -0.36895449 0.05066973 0.10645894 0.1167102 1.00000000 

[[2]] 
      x1   x2   x3   x4   x5 
x1 1.0000000 -0.1011948 -0.85191422 -0.15571603 0.4855237 
x2 -0.1011948 1.0000000 0.56691559 -0.44002621 -0.6699172 
x3 -0.8519142 0.5669156 1.00000000 0.02189754 -0.6168013 
x4 -0.1557160 -0.4400262 0.02189754 1.00000000 0.2236542 
x5 0.4855237 -0.6699172 -0.61680132 0.22365419 1.0000000 

[...]

額外list()是因爲如何data.table解析DT[1,2]語法的第二個要素需要。這已在其他地方的stackoverflow中進行了深入討論，我邀請您閱讀most excellent answer。

作爲一個方面說明，似乎最好用.()替換最外面的呼叫list()以闡明意圖。我還想明確列出參考.SD和.SDcols的列。在相同的結果，你可以重寫你的代碼爲：

TDT[, x := .(list(cor(.SD))), by = Time, .SDcols = 5:9]

來源

2017-03-05 22:52:13 Jealie

您可能會發現corrr程序包對此很有用。結合dplyr命令，您可以輕鬆獲得不同組的相關矩陣。

library(data.table) # not necessary unless you want the data in this format for other reasons 
library(dplyr) 
library(corrr)

每個ID獲取相關矩陣：

> TDT %>% 
+ group_by(Id) %>% 
+ do({ 
+  correlate(select(., x1:x5)) 
+  }) 
Source: local data frame [25 x 7] 
Groups: Id [5] 

     Id rowname   x1   x2   x3   x4   x5 
    <dbl> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
1  1  x1   NA -0.246252411 -0.24589380 -0.181120555 0.14781414 
2  1  x2 -0.24625241   NA 0.32098291 -0.175603686 -0.08863810 
3  1  x3 -0.24589380 0.320982911   NA 0.161336670 0.07934436 
4  1  x4 -0.18112056 -0.175603686 0.16133667   NA -0.19662700 
5  1  x5 0.14781414 -0.088638098 0.07934436 -0.196627000   NA 
6  2  x1   NA 0.075760735 0.41276725 0.425032505 0.37178993 
7  2  x2 0.07576074   NA 0.07747543 -0.004202306 -0.08086958 
8  2  x3 0.41276725 0.077475426   NA 0.248151847 0.07619264 
9  2  x4 0.42503251 -0.004202306 0.24815185   NA 0.37647798 
10  2  x5 0.37178993 -0.080869584 0.07619264 0.376477979   NA 
# ... with 15 more rows

獲取相關矩陣的每個時間：

> TDT %>% 
+ group_by(Time) %>% 
+ do({ 
+  correlate(select(., x1:x5)) 
+ }) 
Source: local data frame [100 x 7] 
Groups: Time [20] 

     Time rowname   x1   x2   x3   x4   x5 
     <date> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
1 2010-01-02  x1   NA -0.66584960 -0.58788152 0.92540707 0.37316217 
2 2010-01-02  x2 -0.66584960   NA -0.06102424 -0.69292534 0.19440850 
3 2010-01-02  x3 -0.58788152 -0.06102424   NA -0.54623949 -0.78714932 
4 2010-01-02  x4 0.92540707 -0.69292534 -0.54623949   NA 0.53697784 
5 2010-01-02  x5 0.37316217 0.19440850 -0.78714932 0.53697784   NA 
6 2010-02-02  x1   NA -0.10444724 -0.62424401 0.30109335 0.04834057 
7 2010-02-02  x2 -0.10444724   NA -0.12010431 0.08966978 -0.68762698 
8 2010-02-02  x3 -0.62424401 -0.12010431   NA -0.92782037 0.52099983 
9 2010-02-02  x4 0.30109335 0.08966978 -0.92782037   NA -0.58214861 
10 2010-02-02  x5 0.04834057 -0.68762698 0.52099983 -0.58214861   NA 
# ... with 90 more rows

來源

2017-03-05 19:51:45

非常好，ty。你也知道如何在'data table'中做到這一點？ – user3032689

不，我不會，對不起:) –

split通過Time，然後爲每個子組

運行 cor

lapply(split(TDT, TDT$Time), function(a) cor(a[,5:9])) 

#OR 

lapply(split(TDT[,5:9], TDT$Time), cor)

來源

2017-03-05 20:58:29

謝謝，它也有效，但它並不使用'datatable'syntax。 – user3032689

@ user3032689，'TDT [，5：9] [，cor（.SD），by = TDT $ Time]'？ –

哦，這很有效，但對我來說，你可以用'時間'來分割，這在'TDT [，5：9]'中不再包含'，這似乎很奇怪。 – user3032689

Correlationmatrix從數據表

回答

相關問題