對於每個組找到若干列的最大值觀測

假設我有一個數據幀，像這樣：對於每個組找到若干列的最大值觀測

set.seed(4) 
df<-data.frame(
    group = rep(1:10, each=3), 
    id = rep(sample(1:3), 10), 
    x = sample(c(rep(0, 15), runif(15))), 
    y = sample(c(rep(0, 15), runif(15))), 
    z = sample(c(rep(0, 15), runif(15))) 
)

如上所見，的x，y，z矢量取零值的一些元素，其餘從0和1

之間的均勻分佈對於每個組，通過第一列確定的，我想找到從第二塔三個ID，指向x，y，z變量在最高值被繪製組。假設除了在給定組的所有觀察值中變量值爲0的情況外，沒有繪製 - 在這種情況下，我不想將任何數字作爲具有最大值的行的標識返回。

輸出看起來像這樣：

group x y z 
    1 2 2 1 
    2 2 3 1 
... .........

我首先想到的是，爲每個變量分別選擇具有最高值的行，然後用merge把它放在一個表。但是，我想知道是否可以在沒有merge的情況下完成，例如使用標準dplyr功能。

來源

2017-08-03 Jean Broc

隨着'data.table'你可以嘗試'setDT（DF）[，lapply（.SD，函數（x）的ID [which.max（X）]），由=基團，.SDcols = C （「x」，「y」，「z」）]' – nicola

當使用'sample'和'runif'等函數時，請使用'set.seed'。你可以試試'library（dplyr）; df％>％group_by（group）％>％summarise_at（vars（-id），funs（which.max））' – Sotos

確實預期輸出中的第一行'1 5 2 4'表示組1具有最高的id值5列在X列，對於列Y中的id2和列z中的id 4？如果是，那麼你是否期待只有2行輸出？那麼爲什麼在預期輸出中延續點？ – Aramis7d

下面是使用plyr我提出的解決方案：

ddply(df,.variables = c("group"), 
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2, 
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})}) 

# group x y z 
#1  1 2 2 1 
#2  2 2 3 1 
#3  3 1 3 2 
#4  4 3 3 1 
#5  5 2 3 NA 
#6  6 3 1 3 
#7  7 1 1 2 
#8  8 NA 2 3 
#9  9 2 1 3 
#10 10 2 NA 2

來源

2017-08-03 10:06:20 TUSHAr

一個解決方案使用dplyr和tidyr。請注意，如果所有數字都相同，我們不能決定應該選擇哪個id。因此filter(n_distinct(Value) > 1)被添加刪除這些記錄。在最終輸出df2,NA指示所有數字相同的情況。如果我們想要，我們可以決定是否在後面推薦那些NA。此解決方案適用於任何數量的id或列（x,y,z，...）。

library(dplyr) 
library(tidyr) 

df2 <- df %>% 
    gather(Column, Value, -group, -id) %>% 
    arrange(group, Column, desc(Value)) %>% 
    group_by(group, Column) %>% 
    # If all values from a group-Column are all the same, remove that group-Column 
    filter(n_distinct(Value) > 1) %>% 
    slice(1) %>% 
    select(-Value) %>% 
    spread(Column, id)

來源

2017-08-03 12:58:53 www

@docendodiscimus感謝您的建議。我用'filter（length（unique（Value））> 1）'替換了那行。 – www

@docendodiscimus順便說一句，當filter（Value！= mean（Value））'失敗時，我能想到的唯一情況是如果在原始'df'中缺少值。但檢查是否只有一個獨特的價值仍然更好。 – www

@docendodiscimus謝謝。然後請檢查我目前的解決方案，看它是否適用於OP的數據。 – www

如果你想堅持只是dplyr，您可以使用多列summarize/mutate功能。這應該工作，不管id的形式;我的初始嘗試稍微更清潔，但假設零的id無效。

df %>% 
    group_by(group) %>% 
    mutate_at(vars(-id), 
      # If the row is the max within the group, set the value 
      # to the id and use NA otherwise 
      funs(ifelse(max(.) != 0 & . == max(.), 
         id, 
         NA))) %>% 
    select(-id) %>% 
    summarize_all(funs(
    # There are zero or one non-NA values per group, so handle both cases 
    if(any(!is.na(.))) 
     na.omit(.) else NA)) 
## # A tibble: 10 x 4 
## group  x  y  z 
## <int> <int> <int> <int> 
## 1  1  2  2  1 
## 2  2  2  3  1 
## 3  3  1  3  2 
## 4  4  3  3  1 
## 5  5  2  3 NA 
## 6  6  3  1  3 
## 7  7  1  1  2 
## 8  8 NA  2  3 
## 9  9  2  1  3 
## 10 10  2 NA  2

來源

2017-08-03 14:21:16

您的解決方案生成的輸出與預期的不同。在運行代碼之前，您可能需要對id進行排序。 – www

@ycw，我認爲這是由於Jean的編輯更新了ID的順序。第一組現在對於id 2是x和y的最大值，對於id 1是z的最大值。 –

在'group_by（group）'之後加上'arrange（id）'，那麼你可以得到與預計一個。但是請注意，您的解決方案報告的數字是行索引，而不是實際的「id」數字。 – www

對於每個組找到若干列的最大值觀測

回答

相關問題