2017-01-30 98 views
0

我最近從Stata遷移到R.我不確定如何對分組和未分組的觀察結果進行計算描述性統計。分組和未分組觀察的描述性統計

這裏是我的數據:

dput(DF) 
structure(list(Product_Name = c("iPhone", "iPhone", "iPhone", 
"iPhone", "iPhone", "iPhone", "Nexus 6P", "Nexus 6P", "Nexus 6P", 
"Nexus 6P", "Nexus 6P", "Nexus 6P"), Product_Type = c("New", 
"New", "Refurbished", "New", "New", "Refurbished", "Refurbished", 
"Refurbished", "Refurbished", "Refurbished", "Refurbished", "Refurbished" 
), Year = c(2006, 2011, 2009, 2008, 2011, 2009, 2012, 2007, 2013, 
2015, 2009, 2010), Units = c(100, 200, 300, 400, 500, 600, 700, 
200, 120, 125, 345, 340)), .Names = c("Product_Name", "Product_Type", 
"Year", "Units"), row.names = c(NA, 12L), class = "data.frame") 

我的數據具有通過一年型銷售的產品。每個產品都可以是翻新產品或新產品。此外,如果它是在2010年之前出售的,我會將其標記爲「時間1」中出售的,否則我會將其標記爲「時間2」中出售的時間。

這裏是我的代碼如下:

DF[DF$Year<2010,"Time"]<-"1" 
DF[DF$Year>=2010,"Time"]<-"2" 

現在,我要生成這些時間段的描述性統計。

DF %>% 
    group_by(Product_Name, Product_Type,Time) %>% 
    dplyr::summarise(Count = n(), 
        Sum_Units = sum(Units,na.rm=TRUE), 
        Avg_Units = mean(Units,na.rm = TRUE), 
        Max_Units=max(Units,na.rm = TRUE)) 

如果我們運行上面的代碼,我們將通過Product_NameProduct_Type獲得描述性統計和Time(即分組描述性統計)。但是,這不是我想要的。我需要描述性統計信息,但不考慮Product_TypeTime的分組。意思是,假設產品在時間1或時間2(即所有年份)內銷售並且與銷售產品的類型無關,我想要計算描述性統計量,同時保留上面的一些分組信息。

預期輸出:

dput(DFOut) 
structure(list(Product_Name = c("iPhone", "Nexus 6P"), New_Units_Sum_Time1 = c(500, 
NA), Refurbished_Units_Sum_Time_1 = c(900, 545), Sum_Units_Time1 = c(1400, 
545), Sum_Units_Time2 = c(700, 1285), Sum_Units_Time_1_And_2 = c(2100, 
1830), Avg_Units_Time1 = c(350, 272.5), Avg_Units_Time2 = c(350, 
321.25), Avg_Units_Time_1_And_2 = c(350, 305), Max_Units_Time1 = c(600, 
345), Max_Units_Time2 = c(500, 700), Max_Units_Time_1_And_2 = c(600, 
700)), .Names = c("Product_Name", "New_Units_Sum_Time1", "Refurbished_Units_Sum_Time_1", 
"Sum_Units_Time1", "Sum_Units_Time2", "Sum_Units_Time_1_And_2", 
"Avg_Units_Time1", "Avg_Units_Time2", "Avg_Units_Time_1_And_2", 
"Max_Units_Time1", "Max_Units_Time2", "Max_Units_Time_1_And_2" 
), row.names = 1:2, class = "data.frame") 

在輸出中,你會看到我有一些描述性統計:

一)根據產品的類型和它出售的時間(例如New_Units_Sum_Time1NewTime1)。請注意,在輸出中,我只顯示了NewTime1的組合。如果您可以指導我如何爲RefurbishedTime的其他組合生成描述性統計信息,那將非常棒。基於忽視產品的類型,但是不忽略)的基礎上既忽略的產品類型和時期的它被賣給(例如Sum_Units_Time_1_And_2)出售的週期(例如Sum_Units_Time1Time1

Ç

b)中。

同上和平均值。

我該怎麼做?我會很感激任何幫助。我真的很努力。


請注意,我使用Excel手動創建了DFOut。雖然我三重檢查過,但可能會有一些手動錯誤 - 如果有問題,我會很樂意澄清它們。感謝您的時間。


sessionInfo()

R version 3.3.2 (2016-10-31) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 
Running under: Windows >= 8 x64 (build 9200) 

locale: 
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C       
[5] LC_TIME=English_United States.1252  

attached base packages: 
[1] grDevices datasets stats  graphics grid  tcltk  utils  methods base  

other attached packages: 
[1] tables_0.8    Hmisc_4.0-2    Formula_1.2-1   survival_2.40-1   
[5] ResourceSelection_0.3-0 magrittr_1.5   stringr_1.1.0   bit64_0.9-5    
[9] bit_1.1-12    tufterhandout_1.2.1  knitr_1.15.1   rmarkdown_1.3   
[13] tufte_0.2    corrplot_0.77   purrr_0.2.2    readr_1.0.0    
[17] tibble_1.2    tidyverse_1.1.1   cowplot_0.7.0   plotly_4.5.6   
[21] ggplot2_2.2.1   maps_3.1.1    directlabels_2015.12.16 tidyr_0.6.1    
[25] ggthemes_3.3.0   R2HTML_2.3.2   lubridate_1.6.0   xts_0.9-7    
[29] zoo_1.7-14    lattice_0.20-34   corrgram_1.10   hexbin_1.27.1   
[33] sm_2.2-5.4    compare_0.2-6   installr_0.18.0   psych_1.6.12   
[37] reshape2_1.4.2   readstata13_0.8.5  pastecs_1.3-18   boot_1.3-18    
[41] vcd_1.4-3    car_2.1-4    xlsxjars_0.6.1   rJava_0.9-8    
[45] debug_1.3.1    dplyr_0.5.0    foreign_0.8-67   gmodels_2.16.2   
[49] openxlsx_4.0.0   plyr_1.8.4    

loaded via a namespace (and not attached): 
[1] minqa_1.2.4   colorspace_1.3-2 class_7.3-14  modeltools_0.2-21 mclust_5.2.2  
[6] rprojroot_1.2  htmlTable_1.9  base64enc_0.1-3  MatrixModels_0.4-1 flexmix_2.3-13  
[11] mvtnorm_1.0-5  xml2_1.1.1   codetools_0.2-15 splines_3.3.2  mnormt_1.5-5  
[16] robustbase_0.92-7 jsonlite_1.2  nloptr_1.0.4  pbkrtest_0.4-6  broom_0.4.1   
[21] cluster_2.0.5  kernlab_0.9-25  httr_1.2.1   backports_1.0.5  assertthat_0.1  
[26] Matrix_1.2-7.1  lazyeval_0.2.0  acepack_1.4.1  htmltools_0.3.5  quantreg_5.29  
[31] tools_3.3.2   gtable_0.2.0  Rcpp_0.12.9   trimcluster_0.1-2 gdata_2.17.0  
[36] nlme_3.1-128  iterators_1.0.8  fpc_2.1-10   lmtest_0.9-34  lme4_1.1-12   
[41] rvest_0.3.2   gtools_3.5.0  dendextend_1.4.0 DEoptimR_1.0-8  MASS_7.3-45   
[46] scales_0.4.1  TSP_1.1-4   hms_0.3    parallel_3.3.2  SparseM_1.74  
[51] RColorBrewer_1.1-2 gridExtra_2.2.1  rpart_4.1-10  latticeExtra_0.6-28 stringi_1.1.2  
[56] gclus_1.3.1   mvbutils_2.7.4.1 foreach_1.4.3  checkmate_1.8.2  seriation_1.2-1  
[61] caTools_1.17.1  prabclus_2.2-6  bitops_1.0-6  evaluate_0.10  htmlwidgets_0.8  
[66] R6_2.2.0   gplots_3.0.1  DBI_0.5-1   haven_1.0.0   whisker_0.3-2  
[71] mgcv_1.8-16   nnet_7.3-12   modelr_0.1.0  KernSmooth_2.23-15 viridis_0.3.4  
[76] readxl_0.1.1  data.table_1.10.0 forcats_0.2.0  digest_0.6.12  diptest_0.75-7  
[81] stats4_3.3.2  munsell_0.4.3  registry_0.3  viridisLite_0.1.3 quadprog_1.5-5  
+1

因此,基本上你每次需要不同的分組變量......?一系列'group_by%>%...%>%ungroup()....%>%... group_by..'? – Sotos

+0

@Sotos - 再次感謝您的幫助。是的,我要補充的唯一解釋是有多個層次:a)根據產品名稱,時間和類型分組的描述性統計b)根據產品名稱,時間分組c)根據產品名稱分組。這有幫助嗎? – watchtower

+1

嗯,是的。這正是我所理解的:) – Sotos

回答

2

一種方式來自動化這是首先用您的分組變量的所有可能組合的載體(ind)。然後我們將這些組合轉換成Units的公式。由於每個公式都保存在列表中(l1),我們遍歷該列表並進行聚合。

ind <- unlist(sapply(c(2,3), function(i) combn(c('Product_Name', 'Product_Type', 'Time'), 
                  i, paste, collapse = '+'))) 

l1 <- sapply(ind, function(i) as.formula(paste('Units ~ ', i))) 

lapply(l1, function(i) aggregate(i, df, FUN = function(j) c(sum1 = sum(j), 
                  avg = mean(j), 
                  max_units = max(j)))) 

#which gives 

#$`Product_Name+Product_Type` 
# Product_Name Product_Type Units.sum1 Units.avg Units.max_units 
#1  iPhone   New  1200  300    500 
#2  iPhone Refurbished  900  450    600 
#3  Nexus 6P Refurbished  1830  305    700 

#$`Product_Name+Time` 
# Product_Name Time Units.sum1 Units.avg Units.max_units 
#1  iPhone 1 1400.00 350.00   600.00 
#2  Nexus 6P 1  545.00 272.50   345.00 
#3  iPhone 2  700.00 350.00   500.00 
#4  Nexus 6P 2 1285.00 321.25   700.00 

#$`Product_Type+Time` 
# Product_Type Time Units.sum1 Units.avg Units.max_units 
#1   New 1  500.00 250.00   400.00 
#2 Refurbished 1 1445.00 361.25   600.00 
#3   New 2  700.00 350.00   500.00 
#4 Refurbished 2 1285.00 321.25   700.00 

#$`Product_Name+Product_Type+Time` 
# Product_Name Product_Type Time Units.sum1 Units.avg Units.max_units 
#1  iPhone   New 1  500.00 250.00   400.00 
#2  iPhone Refurbished 1  900.00 450.00   600.00 
#3  Nexus 6P Refurbished 1  545.00 272.50   345.00 
#4  iPhone   New 2  700.00 350.00   500.00 
#5  Nexus 6P Refurbished 2 1285.00 321.25   700.00 
+0

感謝Sotos ...我將你的'df'改爲'DF1',並且我得到這個錯誤:'eval中的錯誤(expr,envir,enclos):object'Units'not found' 。這是回溯:'10。 eval(expr,envir,enclos)9.eval(predvars,data,env)8.model.frame.default(formula = i,data = DF1)7.stats :: model.frame(formula = i,data = DF1)6.eval(expr,envir,enclos) 5.eval(m,parent.frame())4.agagregate.formula(i,DF1,FUN = function(j)c(sum1 = sum(j), avg = mean(j),max_units = max(j)))3.agagregate(i,DF1,... .2FUN(X [[i]],...) )aggregate(i,DF1,.. – watchtower

+0

嗯......不知道,奇怪,它在這裏完美,你使用的所有東西都是最新的嗎? – Sotos

+0

我已經加入了我的包裹,希望我的東西沒有錯R安裝 – watchtower