2015-07-03 63 views
3

我有一個數據集,其中對於每個組我有一個開始日期和結束日期。我想將這些數據轉換成每個時間段(月份)我都有一組觀察值。R從組的開始日期和結束日期創建時間序列的最佳方式

下面是輸入數據的樣本,基團通過id標識:

structure(list(id = c(723654, 885618, 269861, 1383642, 250276, 
815511, 1506680, 1567855, 667345, 795731), startdate = c("2008-06-29", 
"2008-12-01", "2006-09-27", "2010-02-03", "2006-08-31", "2008-09-10", 
"2010-04-11", "2010-05-15", "2008-04-12", "2008-08-28"), enddate = c("2008-08-13", 
"2009-02-08", "2007-10-12", "2010-09-09", "2007-06-30", "2010-04-27", 
"2010-04-13", "2010-05-16", "2010-04-20", "2010-03-09")), .Names = c("id", 
"startdate", "enddate"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "6", "7", "8", "9", "10", "11")) 

我寫的函數,並且它矢量。該函數採用存儲在每行中的三個參數並生成具有組標識符的時間序列。

genDateRange<-function(start, end, id){ 
    dates<-seq(as.Date(start), as.Date(end), by="month") 
    return(cbind(month=as.character(dates), id=rep(id, length(dates)))) 
} 

genDataRange<-Vectorize(genDateRange) 

我運行如下函數來獲取數據幀。我在輸出中有超過六百萬行,因此需要永久。我需要更快的方式。

range<-do.call(rbind,genDataRange(dat$startdate, dat$enddate, dat$id)) 

輸出的前十行看起來是這樣的:

structure(c("2008-06-29", "2008-07-29", "2008-12-01", "2009-01-01", 
"2009-02-01", "2006-09-27", "2006-10-27", "2006-11-27", "2006-12-27", 
"2007-01-27", "723654", "723654", "885618", "885618", "885618", 
"269861", "269861", "269861", "269861", "269861"), .Dim = c(10L, 
2L), .Dimnames = list(NULL, c("month", "id"))) 

我希望更快的方法來做到這一點。我認爲我過於關注某些事物,並且錯過了一個更簡單的解決方案。

+0

「永遠」多久?你有多少數據 – rawr

+0

你的前10行輸出不太對,應該有2列 – C8H10N4O2

+0

.Dim = c(10L,2L)位創建兩列。 – PoorLifeChoicesMadeMeWhoIAm

回答

2

無需使用發電機功能或rbindlist因爲data.table可以輕鬆地處理這離不開它。

# start with a data.table and date columns 
library(data.table) 
dat <- data.table(dat) 
dat[,`:=`(startdate = as.Date(startdate), enddate = as.Date(enddate))] 
dat[,num_mons:= length(seq(from=startdate, to=enddate, by='month')),by=1:nrow(dat)] 

dat # now your data.table looks like this 
#   id startdate enddate num_mons 
# 1: 723654 2008-06-29 2008-08-13  2 
# 2: 885618 2008-12-01 2009-02-08  3 
# 3: 269861 2006-09-27 2007-10-12  13 
# 4: 1383642 2010-02-03 2010-09-09  8 
# 5: 250276 2006-08-31 2007-06-30  10 
# 6: 815511 2008-09-10 2010-04-27  20 
# 7: 1506680 2010-04-11 2010-04-13  1 
# 8: 1567855 2010-05-15 2010-05-16  1 
# 9: 667345 2008-04-12 2010-04-20  25 
# 10: 795731 2008-08-28 2010-03-09  19 

out <- dat[, list(month=seq.Date(startdate, by="month",length.out=num_mons)), by=id] 
out 
#   id  month 
# 1: 723654 2008-06-29 
# 2: 723654 2008-07-29 
# 3: 885618 2008-12-01 
# 4: 885618 2009-01-01 
# 5: 885618 2009-02-01 
# ---     
# 98: 795731 2009-10-28 
# 99: 795731 2009-11-28 
# 100: 795731 2009-12-28 
# 101: 795731 2010-01-28 
# 102: 795731 2010-02-28 

question是相關的,但不同的是,你的問題我們迭代,在數據表中沒有重複的行。

+0

這個解決方案真的很快。謝謝。 – PoorLifeChoicesMadeMeWhoIAm

1

對於大型數據集此

library(data.table) 
range <- rbindlist(lapply(genDataRange(dat$startdate, dat$enddate, dat$id),as.data.frame)) 

應該是快於

range<-do.call(rbind,genDataRange(dat$startdate, dat$enddate, dat$id)) 
相關問題