2017-04-25 43 views
0

我有一組變量(約21),我想循環併爲每個變量執行以下操作: 1.將組按年份劃分的10組,按年份分配的十分位數確定 2.以這些新組別爲基礎的計算方法(相等和加權)。R data.table中的循環:通過變量分佈創建組,然後通過新組計算means

測試數據:

set.seed(4) 
YR = data.table(yr=1962:2015) 
ID = data.table(id=10001:11000) 
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join 
rm("YR","ID") 
# 54,000 obs now add data 
DT[,`:=` (ratio = rep(sample(10),each=5400)+rnorm(nrow(DT)), 
      ratio2 = rep(sample(5),each=10800)+rnorm(nrow(DT)), 
      weight = abs(rnorm(nrow(DT)))*100, 
      val = rnorm(nrow(DT)) 
      )] 
DT 
     yr id  ratio ratio2 weight  val 
    1: 1962 10001 6.689275 4.895357 129.10487 -0.2022073 
    2: 1962 10002 4.718753 4.505419 140.70420 -0.0887587 
    3: 1962 10003 5.786855 4.359488 242.10988 0.9511465 
    4: 1962 10004 7.896540 4.049974 89.23235 -1.3822148 
    5: 1962 10005 7.776863 2.233036 177.79650 -1.0671091 
    ---             
53996: 2015 10996 10.613272 3.345091 153.81424 0.9269429 
53997: 2015 10997 11.260932 1.804315 15.68129 -1.6618414 
53998: 2015 10998 8.591909 3.332643 134.80929 -1.1632596 
53999: 2015 10999 9.143039 3.012160 178.77301 -0.4761060 
54000: 2015 11000 7.470945 4.068919 121.13470 -1.7594423 

所以,我想通過比循環中,然後比* 2,等等,每一種計算十分位數,然後由每個這些新計算的十分位數的總結VAL。請注意,這些不是編號變量,所以我不能使用paste()和1:21向量重新創建名稱。 首先,我寫了這個功能做分組:

# [function] pctl.grp - order data into groups based on percentil breakpoints 
# Number of groups passed 
pctl.grp <- function(dat, grp) { 
    bp <- quantile(dat, probs = c(0,seq(100/grp,100,100/grp))/100) 
    cut(dat,bp,labels = FALSE, include.lowest = TRUE) 
} 

然後,我可以做一個迭代這樣的:

# adds in new variable containing 10 groups numbered 1-10 
DT[,ratiogrp := lapply(.SD, pctl.grp, 10), by = .(yr), .SDcols = c("ratio")] 

DT[,.(ewval = mean(val), 
     ewratio = mean(ratio), 
     vwval = weighted.mean(val, weight, na.rm = TRUE), 
     vwratio = weighted.mean(ratio, weight, na.rm = TRUE)) ,by=ratiogrp][order(ratiogrp)] 

這給期望的結果:

ratiogrp  ewval ewratio  vwval vwratio 
1:  1 -0.027994385 3.576939 -0.039512050 3.572319 
2:  2 -0.001146009 4.329835 0.005093692 4.331433 
3:  3 -0.009087386 4.784103 -0.012764902 4.767494 
4:  4 -0.014961467 5.094431 -0.015464918 5.110614 
5:  5 0.014705294 5.373705 0.015276699 5.364962 
6:  6 -0.010195630 5.645182 -0.014102394 5.618484 
7:  7 0.001297953 5.949583 -0.012839401 5.925634 
8:  8 -0.009300910 6.265297 -0.007141404 6.263371 
9:  9 0.012970539 6.651047 0.018474949 6.684825 
10:  10 0.003841495 7.363449 -0.004225650 7.351828 

但如何我是否通過21次循環遍歷每個變量?我可以很容易地讓我的變量的名字是這樣的:

> grep(c("ratio"), names(DT)) 
[1] 3 4 
> names(DT)[grep(c("ratio"), names(DT))] 
[1] "ratio" "ratio2" 

所以認爲for (z in 1:length(namelist)) {}或某事會工作。但我不確定如何在data.table結構中引用這些名稱(或數字)來重新創建上面所做的。

回答

2

要長格式...

mDT = melt(DT, meas=patterns("ratio"), value.name = "ratio") 
setorder(mDT, variable, yr, ratio) 
mDT[, dec := cut(.I, 10, labels = FALSE), by=.(yr, variable)] 

mDT[, .(
    mval = mean(val), 
    mrat = mean(ratio), 
    wmval = weighted.mean(val, weight), 
    wmrat = weighted.mean(ratio, weight) 
), keyby=.(variable, dec)] 

    variable dec   mval  mrat  wmval wmrat 
1: ratio 1 -0.0279943849 3.576939 -0.039512050 3.572319 
2: ratio 2 -0.0011460087 4.329835 0.005093692 4.331433 
3: ratio 3 -0.0090873863 4.784103 -0.012764902 4.767494 
4: ratio 4 -0.0149614666 5.094431 -0.015464918 5.110614 
5: ratio 5 0.0147052939 5.373705 0.015276699 5.364962 
6: ratio 6 -0.0101956297 5.645182 -0.014102394 5.618484 
7: ratio 7 0.0012979528 5.949583 -0.012839401 5.925634 
8: ratio 8 -0.0093009096 6.265297 -0.007141404 6.263371 
9: ratio 9 0.0129705386 6.651047 0.018474949 6.684825 
10: ratio 10 0.0038414948 7.363449 -0.004225650 7.351828 
11: ratio2 1 -0.0120823787 1.195964 -0.016154551 1.199026 
12: ratio2 2 -0.0072534833 1.904354 -0.030409684 1.908494 
13: ratio2 3 -0.0283728080 2.282277 -0.028168936 2.301685 
14: ratio2 4 -0.0068901529 2.590815 0.002836866 2.585152 
15: ratio2 5 -0.0035769658 2.880104 0.002391468 2.872702 
16: ratio2 6 0.0087575593 3.147469 0.004565452 3.134459 
17: ratio2 7 -0.0052354409 3.412187 -0.005866282 3.426711 
18: ratio2 8 0..704371 0.009488475 3.701694 
19: ratio2 9 0.0027419978 4.071582 -0.008958386 4.076264 
20: ratio2 10 -0.0002925368 4.786477 0.003691116 4.772209 
+0

輝煌。不過,我不確定爲什麼cut(I,10,labels = FALSE)給出與使用我的函數相同的結果。我將cut()的pctl.grp()函數替換成我想要的。 –

+1

@JesseBlocher酷,很高興它的作品。 '.I'只是連續行號的向量(因爲數據是按yr和變量排序的),一些向量'n:m'; 'cut(n:m,10,labels = FALSE)'將把行剪切成10個相同大小的組;並且由於數據按比例排序,因此這些組是其十分位數。標籤= FALSE部分表示它將整數分配給十進制。這種錯綜複雜的,我知道.... – Frank

+0

我非常確定'cut(x,10)'不會將行放入同等大小的組中。如果你做'x <-rnorm(100)',然後'c < - cut(x,10,labels = FALSE)'做一個'c'的直方圖,你應該得到相同的高度條,而你不這樣做。比較我的功能。這不是重點 - 我認爲我的示例數據是隨機統一的,因此兩者看起來都是一樣的。 –