2016-02-29 61 views
0

計算值大於95%分位點時如下我的數據構造:錯誤使用plyr

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
             "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"), 
         Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), 
         Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
            "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"), 
         Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA)) 

使用dplyr我通過下面的代碼應用的滾動平均值(從2到4秒):

for (summaryFunction in c("mean")) { 
    for (i in seq(2, 4, by = 1)) { 
    tempColumn <- Individ %>% 
     group_by(Participant) %>% 
     transmute(rollapply(Power, 
          width = i, 
          FUN = summaryFunction, 
          align = "right", 
          fill = NA, 
          na.rm = T)) 
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".") 
    Individ <- bind_cols(Individ, tempColumn[2]) 
    } 
} 

我現在希望計算每個滾動平均值中每個ParticipantPower的前5%。爲了計算這個,我用:

Output = ddply(Individ, .(Participant, Condition), summarise, 
      TwoSec <- Rolling.mean.2 > quantile(Rolling.mean.2 , 0.95, na.rm = TRUE)) 

不過,我結束了,指出TRUEFALSE列。相反,我追蹤的是前5%的實際值。我該怎麼做呢?是否還有更簡單的方法來循環查看每個滾動平均值列,按參與者和條件查找每個滾動平均值的前5%?

謝謝!

+0

這個能幫忙嗎? http://stackoverflow.com/questions/19608618/r-percentile-calculations-on-subsets-of-data – 2016-02-29 05:54:09

+0

是的,它是有益的,謝謝你的鏈接。然而,我怎樣才能將每個參與者的所有出現次數都大於95%?我不瞭解其他分位數。 – user2716568

+0

如果我正確理解你的問題,用'dplyr'就可以得到'df%>%group_by(Participant)%>%filter(between(Power, ,1,na.rm = TRUE)))' – alistaire

回答

1

這很好,你有你的滾動數據表,這使計算分位數的工作更容易。

第1步:由參與者,條件組,位置

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
             "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"), 
         Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), 
         Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
            "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"), 
         Location = c("Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home", 
            "Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home"), 
         Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA)) 


library(dplyr) 
library(zoo) 
for (summaryFunction in c("mean")) { 
    for (i in seq(2, 4, by = 1)) { 
    tempColumn <- Individ %>% 
     group_by(Participant) %>% 
     transmute(rollapply(Power, 
          width = i, 
          FUN = summaryFunction, 
          align = "right", 
          fill = NA, 
          na.rm = T)) 
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".") 
    Individ <- bind_cols(Individ, tempColumn[2]) 
    } 
} 


Individ 


    Participant Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 
     (fctr) (dbl) (fctr) (fctr) (dbl)   (dbl)   (dbl)   (dbl) 
1   Bill  1 Placebo  Home 400    NA    NA    NA 
2   Bill  2 Placebo  Home 250   325    NA    NA 
3   Bill  3 Placebo  Home 180   215  276.6667    NA 
4   Bill  4 Placebo  Home 500   340  310.0000   332.5 
5   Bill  1  Expr  Away 300   400  326.6667   307.5 
6   Bill  2  Expr  Away 450   375  416.6667   357.5 
7   Bill  3  Expr  Away 600   525  450.0000   462.5 
8   Bill  4  Expr  Away 512   556  520.6667   465.5 
9   Bill  1  Expr  Home 300   406  470.6667   465.5 
10  Bill  2  Expr  Home 500   400  437.3333   478.0 

讓所有7或8列(該數據集包括位置),所以它回答對方的問題,以及在新的Individ後數據集,這是我做了什麼來解決你的問題。我100%肯定有一個更清潔和更有效的方式來做到這一點,但這裏有邏輯,它應該輸出很好。

步驟2:獲取位數爲基

library(plyr) 
Individ[is.na(Individ)]<- 0 
Top_percentiles <- ddply(Individ, 
         c("Participant", "Condition", "Location"), 
         summarise, 
         Power2 = quantile(Rolling.mean.2, .95), 
         Power3 = quantile(Rolling.mean.3, .95), 
         Power4 = quantile(Rolling.mean.4, .95) 
         ) 

Top_percentiles 

    Participant Condition Location Power2 Power3 Power4 
1  Bill  Expr  Away 551.350 510.0667 465.050 
2  Bill  Expr  Home 464.650 465.6667 476.125 
3  Bill Placebo  Home 337.750 305.0000 282.625 
4  Harry  Expr  Away 585.175 533.4000 485.425 
5  Harry Placebo  Home 322.150 280.7667 268.175 
6  Paul  Expr  Home 556.500 556.5000 408.000 

其是用於爲每個組和相應的滾動平均值的前5%的閾值。

現在唯一要做的就是計算數據集中高於每個閾值的觀測值。

第3步:匹配滾動平均值列與原始數據集

像這樣的事情是有點什麼,我擺弄周圍。

Individ$Power2 <- Top_percentiles$Power2[match(Individ$Participant, Top_percentiles$Participant) && 
             match(Individ$Condition, Top_percentiles$Condition) && 
             match(Individ$Location, Top_percentiles$Location)] 

Individ$Power3 <- Top_percentiles$Power3[match(Individ$Participant, Top_percentiles$Participant) && 
              match(Individ$Condition, Top_percentiles$Condition) && 
              match(Individ$Location, Top_percentiles$Location)] 

Individ$Power4 <- Top_percentiles$Power4[match(Individ$Participant, Top_percentiles$Participant) && 
              match(Individ$Condition, Top_percentiles$Condition) && 
              match(Individ$Location, Top_percentiles$Location)] 


Individ 


    Participant Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 Power2 Power3 
     (fctr) (dbl) (fctr) (fctr) (dbl)   (dbl)   (dbl)   (dbl) (dbl) (dbl) 
1   Bill  1 Placebo  Home 400    0   0.0000   0.0 551.350 510.0667 
2   Bill  2 Placebo  Home 250   325   0.0000   0.0 464.650 465.6667 
3   Bill  3 Placebo  Home 180   215  276.6667   0.0 337.750 305.0000 
4   Bill  4 Placebo  Home 500   340  310.0000   332.5 585.175 533.4000 
5   Bill  1  Expr  Away 300   400  326.6667   307.5 322.150 280.7667 
6   Bill  2  Expr  Away 450   375  416.6667   357.5 556.500 556.5000 
7   Bill  3  Expr  Away 600   525  450.0000   462.5 551.350 510.0667 
8   Bill  4  Expr  Away 512   556  520.6667   465.5 464.650 465.6667 
9   Bill  1  Expr  Home 300   406  470.6667   465.5 337.750 305.0000 
10  Bill  2  Expr  Home 500   400  437.3333   478.0 585.175 533.4000 

我的想法是將分位列匹配到Individual數據集。

第4步:篩選數據集

這應該得到你想要的,你想要的。

選項1:三個獨立的數據集

top_percentile_2sec <- Individ %>% filter(Rolling.mean.2 >= Power2) 
top_percentile_3sec <- Individ %>% filter(Rolling.mean.3 >= Power3) 
top_percentile_4sec <- Individ %>% filter(Rolling.mean.4 >= Power4) 

選項2:一個大的數據集合並

top_percentile_all_times <- Individ %>% filter(Rolling.mean.2 >= Power2 | Rolling.mean.3 >= Power3 | Rolling.mean.4 >= Power4) 


top_percentile_all_times 

Participant Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 Power2 Power3 
     (fctr) (dbl) (fctr) (fctr) (dbl)   (dbl)   (dbl)   (dbl) (dbl) (dbl) 
1  Bill  1  Expr  Away 300   400.0  326.6667   307.50 322.15 280.7667 
2  Bill  4  Expr  Away 512   556.0  520.6667   465.50 464.65 465.6667 
3  Bill  1  Expr  Home 300   406.0  470.6667   465.50 337.75 305.0000 
4  Bill  3  Expr  Home 450   475.0  416.6667   440.50 322.15 280.7667 
5  Harry  1  Expr  Away 310   415.0  320.0000   292.50 322.15 280.7667 
6  Harry  3  Expr  Away 608   529.5  456.3333   472.25 551.35 510.0667 
7  Harry  4  Expr  Away 582   595.0  547.0000   487.75 464.65 465.6667 
8  Paul  3  Expr  Home  0   570.0  480.0000   0.00 322.15 280.7667 
9  Paul  4  Expr  Home  0   0.0  570.0000   480.00 556.50 556.5000 

下面是一個鏈接,極大地幫助了我。

how to calculate 95th percentile of values with grouping variable in R or Excel

這是否解決了從其他後你的問題呢?

+0

謝謝你花時間爲我的問題制定答案 - 我真的很感激!當我在更大的數據框上運行你的代碼時(988,841 obs),在步驟3中返回以下錯誤:'$ < - 。data.frame'('* tmp *'',「Power1」 ,值= c(1.8886312245,: 替換有11行,數據有988841' – user2716568

+0

如果你不提供任何保密信息,你可以提供一個更大的虛擬數據集嗎? 我很難診斷那個錯誤,除非我能看到每一步都會發生什麼 發佈每一步之後會發生什麼情況的屏幕截圖會幫助我或其他人解決這個問題,也可能是您或我的部分出現語法錯誤或打字錯誤請謹慎對待此問題 – InfiniteFlashChess

+0

不幸的是我無法提供實際的數據集,因爲它是機密數據。我設法克服了我的上述問題,但只匹配了「Name」而不是「Location」,因此,您的代碼爲我提供了機會h正是我以後的工作。事實上,我從一開始就對不同位置進行了分類,有點麻煩,但這對分析很有效(最終目標是比較位置)。非常感謝您的幫助和支持,我非常感謝! – user2716568