2016-08-03 83 views
0

我想通過幾個不同的因素來總結數據集。以下是我的數據示例:按日期和組彙總數據框

household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3") 
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9)) 
value<-c(1:9) 
type<-c("income","water","energy","income","water","energy","income","water","energy") 
df<-data.frame(household,date,value,type) 

    household  date value type 
1 household1 1999-05-10 100 income 
2 household1 1999-05-25 200 water 
3 household1 1999-10-12 300 energy 
4 household2 1999-02-02 400 income 
5 household2 1999-08-20 500 water 
6 household2 1999-02-19 600 energy 
7 household3 1999-07-01 700 income 
8 household3 1999-10-13 800 water 
9 household3 1999-01-01 900 energy 

我想按月總結數據。理想情況下,最終的數據集將有每戶12行(每月一筆)和每個支出類別(水,能源,收入)的列,該列是該月總數的總和。

我試着從添加一個帶有短日期的列開始,然後我要過濾每個類型,併爲每個事務類型的總和數據創建一個單獨的數據框。然後,我將把這些數據幀合併在一起以得到彙總的df。我試圖使用ddply對其進行總結,但是它彙總得太多了,我無法保留家庭級別的信息。

ddply(df,.(shortdate),summarize,mean_value=mean(value)) 
    shortdate mean_value 
1  14/07 15.88235 
2  14/09 5.00000 
3  14/10 5.00000 
4  14/11 21.81818 
5  14/12 20.00000 
6  15/01 10.00000 
7  15/02 12.50000 
8  15/04 5.00000 

任何幫助將不勝感激!

+0

是的,我只是懶惰,並沒有輸出完整的DF例 –

+0

是的,理想情況下,我會有每行12行(除非你可以推薦更好的方式)。這匹配另一個df我從另一個來源 –

回答

3

這聽起來像你正在尋找的是一個透視表。我喜歡對這些類型的表使用reshape :: cast。如果給定家庭/年/月組合的給定支出類型返回多於一個值,則會將這些值相加。如果只有一個值,則返回該值。 「總和」參數不是必需的,但僅用於處理異常。我認爲如果你的數據是乾淨的你不應該需要這個參數。

hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3") 
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9)) 
value <- c(1:9) 
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy") 
df <- data.frame(hh, date, value, type) 

# Load lubridate library, add date and year 
library(lubridate) 
df$month <- month(df$date) 
df$year <- year(df$date) 

# Load reshape library, run cast from reshape, creates pivot table 
library(reshape) 
dfNew <- cast(df, hh+year+month~type, value = "value", sum) 

> dfNew 
    hh year month energy income water 
1 hh1 1999  4  3  0  0 
2 hh1 1999 10  0  1  0 
3 hh1 1999 11  0  0  2 
4 hh2 1999  2  0  4  0 
5 hh2 1999  3  6  0  0 
6 hh2 1999  6  0  0  5 
7 hh3 1999  1  9  0  0 
8 hh3 1999  4  0  7  0 
9 hh3 1999  8  0  0  8 
+1

如果我對你的問題的數據透視表性質是正確的,你可能想要以某種方式把它放在問題上或標記它。 – JMT2080AD

+0

是的,這實際上是一個數據透視表!感謝您指出了這一點。完美的工作,我做了標籤的編輯。 –

2

試試這個:

df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m") 
library(dplyr) 
df %>% group_by(ym,type) %>% 
    summarise(mean_value=mean(value)) 

Source: local data frame [9 x 3] 
Groups: ym [?] 

      ym type mean_value 
    <S3: yearmon> <fctr>  <dbl> 
1  jan 1999 income   1 
2  jun 1999 energy   3 
3  jul 1999 energy   6 
4  jul 1999 water   2 
5  ago 1999 income   4 
6  set 1999 energy   9 
7  set 1999 income   7 
8  nov 1999 water   5 
9  dez 1999 water   8 

編輯:寬幅:

reshape2::dcast(dfr, ym ~ type) 

     ym energy income water 
1 jan 1999  NA  1 NA 
2 jun 1999  3  NA NA 
3 jul 1999  6  NA  2 
4 ago 1999  NA  4 NA 
5 set 1999  9  7 NA 
6 nov 1999  NA  NA  5 
7 dez 1999  NA  NA  8 
0

如果我理解正確的您的要求(從問題的描述),這是你在找什麼:

library(dplyr) 
library(tidyr) 

df %>% mutate(date = lubridate::month(date)) %>% 
    complete(household, date = 1:12) %>% 
    spread(type, value) %>% group_by(household, date) %>% 
    mutate(Total = sum(energy, income, water, na.rm = T)) %>% 
    select(household, Month = date, energy:water, Total) 

#Source: local data frame [36 x 6] 
#Groups: household, Month [36] 
# 
# household Month energy income water Total 
#  <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> 
#1 household1  1  NA  NA NA  0 
#2 household1  2  NA  NA NA  0 
#3 household1  3  NA  NA 200 200 
#4 household1  4  NA  NA NA  0 
#5 household1  5  NA  NA NA  0 
#6 household1  6  NA  NA NA  0 
#7 household1  7  NA  NA NA  0 
#8 household1  8  NA  NA NA  0 
#9 household1  9 300  NA NA 300 
#10 household1 10  NA  NA NA  0 
# ... with 26 more rows 

注意:我用你所提供的相同df題。我做的唯一的變化是value列。我用seq(100, 900, 100)

如果我弄錯了,請告訴我,我會刪除我的答案。如果這是正確的,我會添加一個解釋。