彙總數據框，同時保持原始順序，以簡單的方式

我在聚集數據框時遇到了一些問題，同時保持組的原始順序（基於數據框中的第一次出現的順序）。我已經設法做對了，但希望有一個更簡單的方法去實現它。彙總數據框，同時保持原始順序，以簡單的方式

這裏是一個樣本數據集上下工夫：

set.seed(7) 
sel.1 <- sample(1:5, 20, replace = TRUE)  # selection vector 1 
sel.2 <- sample(1:5, 20, replace = TRUE) 
add.1 <- sample(81:100)      # additional vector 1 
add.2 <- sample(81:100) 
orig.df <- data.frame(sel.1, sel.2, add.1, add.2)

幾點需要注意：有兩個選擇列，以確定如何將數據組合在一起。他們將是相同的，他們的名字是已知的。我只在這個數據中增加了兩列，但可能還有更多。我已經給出了以'sel'和'add'開頭的列名，以便更容易遵循，但實際數據具有不同的名稱（所以儘管grep技巧很酷，但在這裏它們不會有用）。

我想要做的是根據'sel'列將數據框聚合成組，然後將所有'add'列彙總在一起。這是很簡單的使用aggregate如下：

# Get the names of all the additional columns 
all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")] 
aggr.df <- aggregate(orig.df[,all.add], 
        by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)

的問題是，該結果由「SEL」列排序;我希望根據每個團隊在原始數據中的首次出現進行排序。

這裏是在使這項工作我盡了最大努力：

## Attempt 1 
# create indices for each row (x) and find the minimum index for each range 
index.df <- aggregate(x = 1:nrow(orig.df), 
         by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), min) 
# Make sure the x vector (indices) are in the right range for aggr.df 
index.order <- (1:nrow(index.df))[order(index.df$x)] 
aggr.df[index.order,] 

## Attempt 2 
# get the unique groups. These are in the right order. 
unique.sel <- unique(orig.df[,c("sel.1", "sel.2")]) 
# use sapply to effectively loop over data and sum additional columns. 
sums <- t(sapply(1:nrow(unique.sel), function (x) { 
    sapply(all.add, function (y) { 
     sum(aggr.df[which(aggr.df$sel.1 == unique.sel$sel.1[x] & 
          aggr.df$sel.2 == unique.sel$sel.2[x]), y]) 
     }) 
})) 
data.frame(unique.sel, sums)

雖然這些給我正確的結果，我希望有人能指出一個簡單的解決方案。如果該解決方案適用於標準R安裝附帶的軟件包，則更好。

我已經看過aggregate和match的文檔，但我找不到答案（我想我希望得到像aggregate的「keep.original.order」參數）。

任何幫助將不勝感激！

更新：（萬一有人碰到這個絆）

這裏是我可以努力了幾天後，發現最徹底的方法：讀

unique(data.frame(sapply(names(orig.df), function(x){ 
    if(x %in% c("sel.1", "sel.2")) orig.df[,x] else 
    ave(orig.df[,x], orig.df$sel.1, orig.df$sel.2, FUN=sum)}, 
simplify=FALSE)))

來源

2012-08-08 Edward

感謝您的更新，這是可能使用data.table的最好的解決方案短。如何讓R開發團隊爲集合實現'keep.original.order'參數？這似乎是一個明顯的疏忽。 – 2013-09-13 09:08:16

有點困難，但它給你你想要的，我添加了一些評論來澄清。

# Define the columns you want to combine into the grouping variable 
sel.col <- grepl("^sel", names(orig.df)) 
# Create the grouping variable 
lev <- apply(orig.df[sel.col], 1, paste, collapse=" ") 
# Split and sum up 
data.frame(unique(orig.df[sel.col]), 
      t(sapply(split(orig.df[!sel.col], factor(lev, levels=unique(lev))), 
        apply, 2, sum)))

輸出看起來像這樣

sel.1 sel.2 add.1 add.2 
1  5  4 96 84 
2  2  2 175 176 
3  1  5 384 366 
5  2  5 95 89 
6  4  1 174 192 
7  2  4 82 87 
8  5  3 91 98 
10  3  2 189 178 
11  1  4 170 183 
14  1  1 100 91 
17  3  3 81 82 
19  5  5 83 88 
20  2  3 90 96

來源

2012-08-08 19:29:09 Backlin

這是短期和data.table簡單。它默認以第一順序返回組。

require(data.table) 
DT = as.data.table(orig.df) 
DT[, list(sum(add.1),sum(add.2)), by=list(sel.1,sel.2)] 

    sel.1 sel.2 V1 V2 
1:  5  4 96 84 
2:  2  2 175 176 
3:  1  5 384 366 
4:  2  5 95 89 
5:  4  1 174 192 
6:  2  4 82 87 
7:  5  3 91 98 
8:  3  2 189 178 
9:  1  4 170 183 
10:  1  1 100 91 
11:  3  3 81 82 
12:  5  5 83 88 
13:  2  3 90 96

這對大數據來說很快，所以如果發現速度問題，以後不需要更改代碼。以下替代語法是傳遞要分組的列的最簡單方法。

DT[, lapply(.SD,sum), by=c("sel.1","sel.2")] 

    sel.1 sel.2 add.1 add.2 
1:  5  4 96 84 
2:  2  2 175 176 
3:  1  5 384 366 
4:  2  5 95 89 
5:  4  1 174 192 
6:  2  4 82 87 
7:  5  3 91 98 
8:  3  2 189 178 
9:  1  4 170 183 
10:  1  1 100 91 
11:  3  3 81 82 
12:  5  5 83 88 
13:  2  3 90 96

，或者by也可以是列名的一個逗號分隔的字符串，也：

DT[, lapply(.SD,sum), by="sel.1,sel.2"]

來源

2012-08-15 11:38:29

彙總數據框，同時保持原始順序，以簡單的方式

回答

相關問題