2011-02-17 71 views
17

我反覆使用的設計模式之一是在數據框上執行「group by」或「split,apply,combine(SAC)」,然後加入聚合數據回到原始數據。例如,在計算每個縣與許多州和縣的數據框中的州平均數偏差時,這很有用。我的總計算很少是一個簡單的意思,但它是一個很好的例子。我經常解決這一問題的方式如下:將聚合值加回到原始數據框

require(plyr) 
set.seed(1) 

## set up some data 
group1 <- rep(1:3, 4) 
group2 <- sample(c("A","B","C"), 12, rep=TRUE) 
values <- rnorm(12) 
df <- data.frame(group1, group2, values) 

## got some data, so let's aggregate 

group1Mean <- ddply(df, "group1", function(x) 
        data.frame(meanValue = mean(x$values))) 
df <- merge(df, group1Mean) 
df 

將會產生很好的彙總數據,如下列:

> df 
    group1 group2 values meanValue 
1  1  A 0.48743 -0.121033 
2  1  A -0.04493 -0.121033 
3  1  C -0.62124 -0.121033 
4  1  C -0.30539 -0.121033 
5  2  A 1.51178 0.004804 
6  2  B 0.73832 0.004804 
7  2  A -0.01619 0.004804 
8  2  B -2.21470 0.004804 
9  3  B 1.12493 0.758598 
10  3  C 0.38984 0.758598 
11  3  B 0.57578 0.758598 
12  3  A 0.94384 0.758598 

這工作,但沒有這樣做,其提高可讀性的替代方式,性能,等等?代碼

+0

請參閱http://stackoverflow.com/questions/4998846/applying-an-aggregate-function-over-multiple-different-slices/5000040#5000040 – 2011-02-17 15:46:40

回答

18

一號線的伎倆:

new <- ddply(df, "group1", transform, numcolwise(mean)) 
new 

group1 group2  values meanValue 
1  1  A 0.48742905 -0.121033381 
2  1  A -0.04493361 -0.121033381 
3  1  C -0.62124058 -0.121033381 
4  1  C -0.30538839 -0.121033381 
5  2  A 1.51178117 0.004803931 
6  2  B 0.73832471 0.004803931 
7  2  A -0.01619026 0.004803931 
8  2  B -2.21469989 0.004803931 
9  3  B 1.12493092 0.758597929 
10  3  C 0.38984324 0.758597929 
11  3  B 0.57578135 0.758597929 
12  3  A 0.94383621 0.758597929 

identical(df, new) 
[1] TRUE 
+0

我忘記了所有關於`transform`的信息。後見之明。但是,謝謝你說明我不熟悉的`numcolwise`。 – 2011-02-17 15:57:22

+0

這是一個很好的習慣用法,但是當一些變量應該是總和和其他的含義時,要做的很棘手。 – richiemorrisroe 2012-08-14 17:49:17

+0

@richiemorrisroe比任何其他成語都更棘手嗎? – Andrie 2012-08-14 20:05:09

9

你就不能添加到x傳遞給ddply功能?

df <- ddply(df, "group1", function(x) 
      data.frame(x, meanValue = mean(x$values))) 
13

我覺得ave()是比較有用這裏比plyr打電話告訴你(我不夠熟悉plyr知道,如果你可以做你直接或不想與plyr什麼,我會感到驚訝,如果你不能)或其他基礎R替代品(aggregate()tapply()):

> with(df, ave(values, group1, FUN = mean)) 
[1] -0.121033381 0.004803931 0.758597929 -0.121033381 0.004803931 
[6] 0.758597929 -0.121033381 0.004803931 0.758597929 -0.121033381 
[11] 0.004803931 0.758597929 

您可以使用within()transform()直接嵌入這個結果到df

> df2 <- within(df, meanValue <- ave(values, group1, FUN = mean)) 
> head(df2) 
    group1 group2  values meanValue 
1  1  A 0.4874291 -0.121033381 
2  2  B 0.7383247 0.004803931 
3  3  B 0.5757814 0.758597929 
4  1  C -0.3053884 -0.121033381 
5  2  A 1.5117812 0.004803931 
6  3  C 0.3898432 0.758597929 
> df3 <- transform(df, meanValue = ave(values, group1, FUN = mean)) 
> all.equal(df2,df3) 
[1] TRUE 

而且,如果排序是非常重要的:

> head(df2[order(df2$group1, df2$group2), ]) 
    group1 group2  values meanValue 
1  1  A 0.48742905 -0.121033381 
10  1  A -0.04493361 -0.121033381 
4  1  C -0.30538839 -0.121033381 
7  1  C -0.62124058 -0.121033381 
5  2  A 1.51178117 0.004803931 
11  2  A -0.01619026 0.004803931 
13

在性能方面,你可以做到這一點同一種使用data.table包,裏面有內置的聚集和非常快多虧指標和操作基於C的實現。例如,給出df已經存在於你的例子中:

 
library("data.table") 
dt<-as.data.table(df) 
setkey(dt,group1) 
dt<-dt[,list(group2,values,meanValue=mean(values)),by=group1] 
dt 
     group1 group2  values meanValue 
[1,]  1  A 0.82122120 0.18810771 
[2,]  1  C 0.78213630 0.18810771 
[3,]  1  C 0.61982575 0.18810771 
[4,]  1  A -1.47075238 0.18810771 
[5,]  2  B 0.59390132 0.03354688 
[6,]  2  A 0.07456498 0.03354688 
[7,]  2  B -0.05612874 0.03354688 
[8,]  2  A -0.47815006 0.03354688 
[9,]  3  B 0.91897737 -0.20205707 
[10,]  3  C -1.98935170 -0.20205707 
[11,]  3  B -0.15579551 -0.20205707 
[12,]  3  A 0.41794156 -0.20205707

I have not benchmarked it, but in my experience it is a lot faster.

If you decide to go down the data.table road, which I think is worth exploring if you work with large data sets, you really need to read the docs because there are some differences from data frame that can bite you if you are unaware of them. However, notably data.table generally does work with any function expecting a data frame,as a data.table will claim its type is data frame (data table inherits from data frame).

[ Feb 2011 ]


[ Aug 2012 ] Update from Matthew :

New in v1.8.2 released to CRAN in July 2012 is :=按組。這與上面的答案非常相似,但是將新列添加到dt,因此沒有副本,也不需要合併步驟或重新存在現有列以便與聚合一起返回。首先不需要setkey,它可以處理非連續的組(即未組合在一起的組)。

這是signficantly更快大型數據集,並具有簡單的和短的語法:

dt <- as.data.table(df) 
dt[, meanValue := mean(values), by = group1] 
1

dplyr可能性:

library(dplyr) 
df %>% 
    group_by(group1) %>% 
    mutate(meanValue = mean(values)) 

這將返回原始順序的數據幀。如果您希望通過「group1」訂購,請將arrange(group1)添加到管道。