2016-12-16 68 views
0

我正在處理一些跨越數年的日常降雨數據。我想在連續的雨天總結降雨量以獲得該降雨事件的總降雨量。獲得每個事件的起止日期和降雨強度也很好。我在想我可以跟aggregate一起破解一些東西,但是我在腦海裏想做的事情似乎很龐大。有沒有快速和優雅的解決方案可能與dplyr,tdyrdata.table找到。R中的事件數據總和

數據

structure(list(Time = structure(c(1353398400, 1353484800, 1353571200, 
1353657600, 1353744000, 1353830400, 1353916800, 1354003200, 1354089600, 
1354176000, 1354262400, 1354348800, 1354435200, 1354521600, 1354608000, 
1354694400, 1354780800, 1354867200, 1354953600, 1355040000, 1355126400, 
1355212800, 1355299200, 1355385600, 1355472000, 1355558400, 1355644800, 
1355731200, 1355817600, 1355904000, 1355990400, 1356076800, 1356163200, 
1356249600, 1356336000, 1356422400, 1356508800, 1356595200, 1356681600, 
1356768000, 1356854400, 1356940800, 1357027200, 1357113600, 1357200000, 
1357286400, 1357372800, 1357459200, 1357545600, 1357632000, 1357718400 
), class = c("POSIXct", "POSIXt"), tzone = ""), inc = c(NA, NA, 
NA, NA, NA, NA, NA, 0.11, NA, 0.62, 0.0899999999999999, 0.39, 
NA, NA, 0.03, NA, NA, NA, NA, NA, NA, 0.34, NA, NA, NA, NA, 0.0600000000000001, 
0.02, NA, NA, NA, 0.29, 0.35, 0.02, 0.27, 0.17, 0.0600000000000001, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.47, NA, NA, NA, 0.0300000000000002 
)), .Names = c("Time", "inc"), row.names = 50:100, class = "data.frame") 

所需的輸出

Begin End Days Total Intensity 
11/27/2012 11/27/2012 1 0.11 0.11 
11/29/2012 12/1/2012 3 1.1 0.366666667 
12/4/2012 12/4/2012 1 0.03 0.03 
12/11/2012 12/11/2012 1 0.34 0.34 
12/16/2012 12/17/2012 2 0.08 0.04 
12/21/2012 12/26/2012 6 0.29 0.048333333 
1/5/2013 1/5/2013 1 0.47 0.47 
1/9/2013 1/9/2013 1 0.03 0.03 

回答

3

data.table::rleid被用於連續值處理,假設你的數據幀中的方便的功能被命名爲df並已通過排序時間可變之前:

library(data.table) 
setDT(df) 
na.omit(df[,.(Begin = as.Date(first(Time)), 
       End = as.Date(last(Time)), 
       Days = as.Date(last(Time)) - as.Date(first(Time)) + 1, 
       Total = sum(inc), Intensity = mean(inc)), 
      by = .(id = rleid(is.na(inc)))]) 

# id  Begin  End Days Total Intensity 
#1: 2 2012-11-27 2012-11-27 1 days 0.11 0.1100000 
#2: 4 2012-11-29 2012-12-01 3 days 1.10 0.3666667 
#3: 6 2012-12-04 2012-12-04 1 days 0.03 0.0300000 
#4: 8 2012-12-11 2012-12-11 1 days 0.34 0.3400000 
#5: 10 2012-12-16 2012-12-17 2 days 0.08 0.0400000 
#6: 12 2012-12-21 2012-12-26 6 days 1.16 0.1933333 #I think you have some miscalculation here 
#7: 14 2013-01-05 2013-01-05 1 days 0.47 0.4700000 
#8: 16 2013-01-09 2013-01-09 1 days 0.03 0.0300000 
+0

'asDate(first(Time))中的錯誤:找不到函數「first」'。是否有另一個包含函數'first'的包? – CCurtis

+0

不可以。它附帶'data.table()'包。你正在使用哪個版本的'data.table'? – Psidom

+0

就是這樣。正在使用1.96。現在更新並運行。謝謝。我也喜歡'dplyr'解決方案,但這實際上更簡潔。 'rleid'肯定會剪掉很多代碼。 – CCurtis

1

這是一種使用dplyr的方法。

首先,一些初步清理:需要日期變量,而不是一個POSIXct:

library(dplyr) 

df2 <- df %>% 
    mutate(date = as.Date(Time)) %>% 
    select(-Time) 

此計算用顯式可變的數據幀爲rain_event

df3 <- df2 %>% 
    filter(!is.na(inc)) %>% 
    mutate(
    day_lag = as.numeric(difftime(date, lag(date), units = "days")), 
    # special case: first rain event 
    day_lag = ifelse(is.na(day_lag), 1, day_lag), 
    rain_event = 1 + cumsum(day_lag > 1) 
) 

> df3 
    inc  date day_lag rain_event 
1 0.11 2012-11-27  1   1 
2 0.62 2012-11-29  2   2 
3 0.09 2012-11-30  1   2 
4 0.39 2012-12-01  1   2 
5 0.03 2012-12-04  3   3 
6 0.34 2012-12-11  7   4 
7 0.06 2012-12-16  5   5 
8 0.02 2012-12-17  1   5 
9 0.29 2012-12-21  4   6 
10 0.35 2012-12-22  1   6 
11 0.02 2012-12-23  1   6 
12 0.27 2012-12-24  1   6 
13 0.17 2012-12-25  1   6 
14 0.06 2012-12-26  1   6 
15 0.47 2013-01-05  10   7 
16 0.03 2013-01-09  4   8 

現在,通過總結每次下雨事件,計算您關心的指標:

df3 %>% 
    group_by(rain_event) %>% 
    summarise(
    begin = min(date), 
    end = max(date), 
    days = n(), 
    total = sum(inc), 
    intensity = mean(inc) 
) 

    # A tibble: 8 × 6 
    rain_event  begin  end days total intensity 
     <dbl>  <date>  <date> <int> <dbl>  <dbl> 
1   1 2012-11-27 2012-11-27  1 0.11 0.1100000 
2   2 2012-11-29 2012-12-01  3 1.10 0.3666667 
3   3 2012-12-04 2012-12-04  1 0.03 0.0300000 
4   4 2012-12-11 2012-12-11  1 0.34 0.3400000 
5   5 2012-12-16 2012-12-17  2 0.08 0.0400000 
6   6 2012-12-21 2012-12-26  6 1.16 0.1933333 
7   7 2013-01-05 2013-01-05  1 0.47 0.4700000 
8   8 2013-01-09 2013-01-09  1 0.03 0.0300000 
1

只有基礎包,基本上使用聚合函數。我知道這不是最好的選擇。唯一的問題是與日期的格式(數據幀的列必須被指定一個接一個爲所需的日期格式,否則它會被轉換爲整數):

data1 <- structure(list(Time = structure(c(1353398400, 1353484800, 1353571200, 
    1353657600, 1353744000, 1353830400, 1353916800, 1354003200, 1354089600, 
    1354176000, 1354262400, 1354348800, 1354435200, 1354521600, 1354608000, 
    1354694400, 1354780800, 1354867200, 1354953600, 1355040000, 1355126400, 
    1355212800, 1355299200, 1355385600, 1355472000, 1355558400, 1355644800, 
    1355731200, 1355817600, 1355904000, 1355990400, 1356076800, 1356163200, 
    1356249600, 1356336000, 1356422400, 1356508800, 1356595200, 1356681600, 
    1356768000, 1356854400, 1356940800, 1357027200, 1357113600, 1357200000, 
    1357286400, 1357372800, 1357459200, 1357545600, 1357632000, 1357718400 
    ), class = c("POSIXct", "POSIXt"), tzone = ""), inc = c(NA, NA, 
    NA, NA, NA, NA, NA, 0.11, NA, 0.62, 0.0899999999999999, 0.39, 
    NA, NA, 0.03, NA, NA, NA, NA, NA, NA, 0.34, NA, NA, NA, NA, 0.0600000000000001, 
    0.02, NA, NA, NA, 0.29, 0.35, 0.02, 0.27, 0.17, 0.0600000000000001, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.47, NA, NA, NA, 0.0300000000000002 
    )), .Names = c("Time", "inc"), row.names = 50:100, class = "data.frame") 

rainruns <- function(datas = data1) { 
    incs <- c(NA, datas$inc) # last column 
    event <- cumsum(is.na(incs[-length(incs)]) & !is.na(incs[-1])) # counter for rain events 
    datas <- cbind(datas, event) # add events column 
    datas2 <- datas[!is.na(datas$inc),] # delete na's 
    summarydata1 <- aggregate(datas2$inc, by = list(datas2$event), # summarize rain data by event 
           FUN = function(x) c(length(x), sum(x), mean(x)))[[2]] 
    summarydata2 <- aggregate(as.Date(datas2$Time), by = list(datas2$event), # summarize dates by event 
           FUN = function(x) c(min(x), max(x)))[[2]] 
    summarydata <- data.frame(format(as.Date(summarydata2[,1], # combine both, correcting date formats 
              origin = "1970-01-01"), "%m/%d/%Y"), 
           format(as.Date(summarydata2[,2], 
              origin = "1970-01-01"), "%m/%d/%Y"), summarydata1) 
    names(summarydata) <- c("Begin", "End", "Days", "Total", "Intensity") # update column names 
    return(summarydata) 
} 
+0

謝謝,是的,這就是爲什麼我想看看'聚合'以外的選項。它是一個很棒的功能,但當你開始嘗試做太多時,它會變得笨重。 – CCurtis

1

可以附加新當它們代表連續的雨季時,將它們組合在一起,然後使用dplyr獲得所需的統計數據。假設你的數據幀被稱爲df

library(dplyr) 
rain_period = rep(NA,nrow(df)) #initialize vector 
group=1 #initialize group number 
for(i in 1:nrow(df)){ 
    if(is.na(df$inc[i])) group = group + 1 
    else rain_period[i] = group 
} 
df$group = rain_period 


result = dplyr::group_by(df,group) 
result = dplyr::summarise(result, 
         Begin = min(Time), 
         End = max(Time), 
         Days = n(), 
         Total = sum(inc), 
         Intensity = mean(inc))