2017-03-09 78 views
2

我想分割一個數據框的多列,以便我可以看到summary()輸出的每個數據子集。什麼是** tidyverse **方法分裂一個DF多列?

這裏有一個辦法做到這一點使用split()base

library(tidyverse) 
#> Loading tidyverse: ggplot2 
#> Loading tidyverse: tibble 
#> Loading tidyverse: tidyr 
#> Loading tidyverse: readr 
#> Loading tidyverse: purrr 
#> Loading tidyverse: dplyr 
#> Conflicts with tidy packages ---------------------------------------------- 
#> filter(): dplyr, stats 
#> lag(): dplyr, stats 

mtcars %>% 
    select(1:3) %>% 
    mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE), 
     GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% 
    split(list(.$GRP_A, .$GRP_B)) %>% 
    map(summary) 
#> $A.1 
#>  mpg    cyl   disp   GRP_A   
#> Min. :10.40 Min. :4.0 Min. :108.0 Length:10   
#> 1st Qu.:14.97 1st Qu.:4.5 1st Qu.:151.9 Class :character 
#> Median :18.50 Median :7.0 Median :259.3 Mode :character 
#> Mean :17.61 Mean :6.4 Mean :283.4      
#> 3rd Qu.:20.85 3rd Qu.:8.0 3rd Qu.:430.0      
#> Max. :24.40 Max. :8.0 Max. :472.0      
#>  GRP_B 
#> Min. :1 
#> 1st Qu.:1 
#> Median :1 
#> Mean :1 
#> 3rd Qu.:1 
#> Max. :1 
#> 
#> $B.1 
#>  mpg    cyl   disp   GRP_A   
#> Min. :15.00 Min. :4.0 Min. : 75.7 Length:5   
#> 1st Qu.:21.00 1st Qu.:4.0 1st Qu.: 78.7 Class :character 
#> Median :21.50 Median :4.0 Median :120.1 Mode :character 
#> Mean :24.06 Mean :5.2 Mean :147.1      
#> 3rd Qu.:30.40 3rd Qu.:6.0 3rd Qu.:160.0      
#> Max. :32.40 Max. :8.0 Max. :301.0      
#>  GRP_B 
#> Min. :1 
#> 1st Qu.:1 
#> Median :1 
#> Mean :1 
#> 3rd Qu.:1 
#> Max. :1 
#> 
#> $A.2 
#>  mpg    cyl    disp   GRP_A   
#> Min. :15.20 Min. :4.000 Min. : 95.1 Length:9   
#> 1st Qu.:16.40 1st Qu.:6.000 1st Qu.:160.0 Class :character 
#> Median :18.10 Median :8.000 Median :275.8 Mode :character 
#> Mean :19.84 Mean :6.667 Mean :234.0      
#> 3rd Qu.:21.00 3rd Qu.:8.000 3rd Qu.:275.8      
#> Max. :30.40 Max. :8.000 Max. :360.0      
#>  GRP_B 
#> Min. :2 
#> 1st Qu.:2 
#> Median :2 
#> Mean :2 
#> 3rd Qu.:2 
#> Max. :2 
#> 
#> $B.2 
#>  mpg    cyl   disp   GRP_A   
#> Min. :13.30 Min. :4 Min. : 71.1 Length:8   
#> 1st Qu.:14.97 1st Qu.:4 1st Qu.:125.3 Class :character 
#> Median :20.55 Median :6 Median :201.5 Mode :character 
#> Mean :20.99 Mean :6 Mean :213.5      
#> 3rd Qu.:23.93 3rd Qu.:8 3rd Qu.:315.5      
#> Max. :33.90 Max. :8 Max. :360.0      
#>  GRP_B 
#> Min. :2 
#> 1st Qu.:2 
#> Median :2 
#> Mean :2 
#> 3rd Qu.:2 
#> Max. :2 

我如何使用tidyverse動詞實現這一相同的結果?我最初的想法是使用purrr::by_slice(),但顯然這已被棄用。

+0

是否有你不能用分裂的理由?你是否希望明確分裂或將group_by工作? –

+1

我儘量避免混合r「方言」,所以'。$ GRP_A'不符合我的口味。 'group_by'不好 - 它返回一個分組數據幀,但是'summary()'不能識別這些組。 – Tiernan

+0

現在我已經輸入了它,我傾向於使用'tidyverse''方言'可能是不必要的挑剔......但是如果我有選擇的話,我可以隨時在'split'處用'tidyverse'動詞,所以我只是想我會看看是否有我忽視的東西。 – Tiernan

回答

2

「整齊」解決方案似乎是「mutate + list-cols + purrr」according to Hadley的組合。


library(tidyverse) 
library(magrittr) 

# group, nest, create a new col leveraging purrr::map() 
mt_summary <- 
    mtcars %>% 
    select(1:3) %>% 
    mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE), 
      GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% 
    group_by(GRP_A, GRP_B) %>% 
    nest() %>% 
    mutate(SUMMARY = map(data, .f = summary)) 

# check the structure 
mt_summary 
#> # A tibble: 4 × 4 
#> GRP_A GRP_B    data  SUMMARY 
#> <chr> <int>   <list>  <list> 
#> 1  A  1 <tibble [11 × 3]> <S3: table> 
#> 2  B  2 <tibble [9 × 3]> <S3: table> 
#> 3  A  2 <tibble [7 × 3]> <S3: table> 
#> 4  B  1 <tibble [5 × 3]> <S3: table> 

# extract the summaries 
extract2(mt_summary, "SUMMARY") %>% 
    set_names(paste0(extract2(mt_summary, "GRP_A"), 
        extract2(mt_summary, "GRP_B"))) 
#> $A1 
#>  mpg    cyl    disp  
#> Min. :10.40 Min. :4.000 Min. : 75.7 
#> 1st Qu.:15.25 1st Qu.:4.000 1st Qu.:120.9 
#> Median :19.20 Median :6.000 Median :167.6 
#> Mean :20.43 Mean :6.182 Mean :229.0 
#> 3rd Qu.:25.85 3rd Qu.:8.000 3rd Qu.:309.5 
#> Max. :30.40 Max. :8.000 Max. :460.0 
#> 
#> $B2 
#>  mpg    cyl    disp  
#> Min. :15.20 Min. :4.000 Min. : 78.7 
#> 1st Qu.:17.80 1st Qu.:4.000 1st Qu.:120.3 
#> Median :19.20 Median :6.000 Median :167.6 
#> Mean :20.84 Mean :6.222 Mean :225.9 
#> 3rd Qu.:21.50 3rd Qu.:8.000 3rd Qu.:351.0 
#> Max. :32.40 Max. :8.000 Max. :400.0 
#> 
#> $A2 
#>  mpg    cyl    disp  
#> Min. :15.20 Min. :4.000 Min. : 71.1 
#> 1st Qu.:18.90 1st Qu.:4.000 1st Qu.:114.5 
#> Median :21.40 Median :6.000 Median :145.0 
#> Mean :21.79 Mean :5.429 Mean :176.0 
#> 3rd Qu.:22.10 3rd Qu.:6.000 3rd Qu.:241.5 
#> Max. :33.90 Max. :8.000 Max. :304.0 
#> 
#> $B1 
#>  mpg    cyl   disp  
#> Min. :10.40 Min. :4.0 Min. :140.8 
#> 1st Qu.:13.30 1st Qu.:8.0 1st Qu.:275.8 
#> Median :14.30 Median :8.0 Median :350.0 
#> Mean :15.62 Mean :7.2 Mean :319.7 
#> 3rd Qu.:17.30 3rd Qu.:8.0 3rd Qu.:360.0 
#> Max. :22.80 Max. :8.0 Max. :472.0 
+0

我在[這裏](https://github.com/hadley/dplyr/issues/1118)上讀到「do」會「走開」,儘管目前還不清楚這意味着什麼。 – aosmith

+0

嗯......我會在'purrr' github repo上提出一個問題,看看我能否讓其中一位開發人員權衡這個問題。感謝@aosmith的領導! – Tiernan

+0

似乎其他人提倡'by_slice'保存:https://github.com/hadley/purrr/issues/270 – Tiernan