2015-07-19 108 views
1

因此,假設我有一個數據集,看起來像這一點,我與R工作組:R中具有特定ID的條目數量的平均值?

player  at_bat opponent_name  game result 
Torri_Hunter 1 Pittsburgh Pirates 1 home run 
Torri_Hunter 2 Pittsburgh Pirates 1 triple 
Torri_Hunter 3 Pittsburgh Pirates 1 strikeout 
Torri_Hunter 4 Pittsburgh Pirates 1 strikeout 
Torri_Hunter 1 Pittsburgh Pirates 2 groundout 
Torri_Hunter 2 Pittsburgh Pirates 2 home run 
Torri_Hunter 3 Pittsburgh Pirates 2 flyout 
Torri_Hunter 1 Pittsburgh Pirates 2 home run 
Torri_Hunter 2 Pittsburgh Pirates 3 triple 
Torri_Hunter 3 Pittsburgh Pirates 3 strikeout 
Torri_Hunter 4 Pittsburgh Pirates 3 strikeout 
Torri_Hunter 1 Detroit Tigers  1 home run 
Torri_Hunter 2 Detroit Tigers  1 home run 
Torri_Hunter 3 Detroit Tigers  1 home run 
Torri_Hunter 4 Detroit Tigers  1 strikeout 

(我知道鳥居的名字被拼寫錯了,多包涵在這裏)。

而我最終想通過遊戲來計算本壘打的百分比在一個系列,有一些看起來像這樣結束了:

   opponent_name  game_1s game_2s game_3s 
Torri Hunter Pittsburgh Pirates 25%  50%  0% 
Torri Hunter Detroit Tigers  75%  --  -- 

我可以dplyr ::過濾下來的結果,理貨()各遊戲的ID統計,然後導出到.csv,我可以在excel中獲得平均值(這就是我一直在做的),但是在R中完全可以做到這一點。任何想法?

回答

4

你可以這樣做:

library(dplyr) 
df %>% 
    group_by(player, opponent_name, game) %>% 
    summarise(p = sum(result == "home run")/n()) 

其中給出:

#Source: local data frame [4 x 4] 
#Groups: player, opponent_name 
# 
#  player  opponent_name game p 
#1 Torri_Hunter  Detroit Tigers 1 0.75 
#2 Torri_Hunter Pittsburgh Pirates 1 0.25 
#3 Torri_Hunter Pittsburgh Pirates 2 0.50 
#4 Torri_Hunter Pittsburgh Pirates 3 0.00 

以匹配所需的輸出,你也可以這樣做:

df %>% 
    group_by(player, opponent_name, game) %>% 
    summarise(p = mean(result == "home run")) %>% 
    tidyr::spread(game, p) %>% 
    arrange(desc(opponent_name)) %>% 
    setNames(c(names(.)[1:2], paste0("game_", names(.)[3:5], "s"))) %>% 
    mutate_each(funs(ifelse(is.na(.), "--", paste0(. * 100, "%"))), -(player:opponent_name)) 

其中給出:

#Source: local data frame [2 x 5] 
# 
#  player  opponent_name game_1s game_2s game_3s 
#1 Torri_Hunter Pittsburgh Pirates  25%  50%  0% 
#2 Torri_Hunter  Detroit Tigers  75%  --  -- 
+1

是的。這是一個。很好,謝謝! – skathan

+0

爲什麼不使用'mean'而不是sum(...)/ n() – Rentrop

+0

@ Floo0當然這也可以。 –

0

如何編寫兩個函數來幫助你?假設你的數據框是調用df。

perc_res <- function(opponent, game="1" player="Torri_Hunter", result="home run"){ 
    return(
    dim(df[df$player==player & df$opponent==opponent & df$result==result & df$game==game,])[1]/ 
     dim(df[df$player==player & df$opponent==opponent & df$game==game,])[1] 
) 
} 

然後可以使輸出數據框,看起來像

out.df <- data.frame(Opponent=levels(factor(df$opponent)), Player="Torri_Hunter") 
out.df$game1s <- lapply(out.df$Opponent, perc_res, game=1) 

等 如果以後希望有更多的玩家,你可以使用mapply

ps:實際上還沒有運行代碼,所以可能仍然存在一些常見的錯誤。但我認爲這至少應該讓你開始!

2

data.table溶液澆鑄將是

require(data.table) 
setDT(dat) 
percentage <- dat[,mean(result == "home run"), by = c("player", "opponent_name", "game")] 

結果:

> percentage 

     player  opponent_name game V1 
1: Torri_Hunter Pittsburgh Pirates 1 0.25 
2: Torri_Hunter Pittsburgh Pirates 2 0.50 
3: Torri_Hunter Pittsburgh Pirates 3 0.00 
4: Torri_Hunter  Detroit Tigers 1 0.75 

它投射到如問題

require(reshape2) 
dcast(percentage, player + opponent_name ~ game , value.var = "V1") 

結果所需的輸出:

 player  opponent_name 1 2 3 
1 Torri_Hunter  Detroit Tigers 0.75 NA NA 
2 Torri_Hunter Pittsburgh Pirates 0.25 0.5 0