2015-09-07 62 views
2

我有一個很大的數據幀x與股票價格特定的日期。我想將這個數據集與一個日期變量合併,然後填入最後一個已知的x,直到下一個有效日期,以便我得到數據框z。下面的例子顯示了一個股票的情況。填寫沒有循環的值

我正在使用一個循環,但過程非常緩慢,因爲我有五到十年的每日數據和數千股票。

有沒有其他的方法?在Matlab中,相同的代碼運行得更快。

重要將是我還可以使用比簡單is.na替代狀態(z [噸,2] == TRUE條件

下面是示例:

> x=data.frame(c("2015-05-31","2015-06-30","2015-07-31"),c(100,200,150)) 
> colnames(x)=c("Date","AAPL") 
> x[,1]=as.Date(x[,1],origin="1970-01-01") 
> 
> x 
     Date AAPL 
1 2015-05-31 100 
2 2015-06-30 200 
3 2015-07-31 150 
> 
> date=data.frame(c("2015-05-31","2015-06-01","2015-06-02","2015-06-03","2015-06-04","2015-06-05","2015-06-06","2015-06-07","2015-06-08","2015-06-09","2015-06-10","2015-06-11","2015-06-12","2015-06-13","2015-06-14","2015-06-15","2015-06-16","2015-06-17","2015-06-18","2015-06-19","2015-06-20","2015-06-21","2015-06-22","2015-06-23","2015-06-24","2015-06-25","2015-06-26","2015-06-27","2015-06-28","2015-06-29","2015-06-30","2015-07-01","2015-07-02","2015-07-03","2015-07-04","2015-07-05","2015-07-06","2015-07-07","2015-07-08","2015-07-09","2015-07-10","2015-07-11","2015-07-12","2015-07-13","2015-07-14","2015-07-15","2015-07-16","2015-07-17","2015-07-18","2015-07-19","2015-07-20","2015-07-21","2015-07-22","2015-07-23","2015-07-24","2015-07-25","2015-07-26","2015-07-27","2015-07-28","2015-07-29","2015-07-30","2015-07-31")) 
> colnames(date)=c("Date") 
> date[,1]=as.Date(date[,1],origin="1970-01-01") 
> 
> date 
     Date 
1 2015-05-31 
2 2015-06-01 
3 2015-06-02 
29 ... 
30 2015-06-29 
31 2015-06-30 
32 2015-07-01 
33 2015-07-02 

> 
> z=merge(x=x, y=date, by.x="Date", by.y="Date",all.y=TRUE) 
> 
> 
> #Converting x to a data matrix speeds up the loop 
> z=data.matrix(z) 
> 
> for (t in 1:nrow(z)) { 
+ if (is.na(z[t,2]==TRUE)){ 
+  z[t,2]=z[t-1,2] 
+ } else if (is.na(z[t,2]==TRUE)){ 
+  z[t,2]=z[t,2] 
+ } 
+ } 
> 
> z=as.data.frame(z) 
> z[,1]=as.Date(z[,1],origin="1970-01-01") 
> 
> z 
     Date AAPL 
1 2015-05-31 100 
2 2015-06-01 100 
3 2015-06-02 100 
29 ... 
30 2015-06-29 100 
31 2015-06-30 200 
32 2015-07-01 200 
33 2015-07-02 200 

回答

2

我們可以使用base R做到這一點。我們得到非NA'AAPL'元素('i1')的邏輯索引,cumsum'i1'轉換爲numeric索引,用它來替換具有非NA元素的NA元素。

i1 <- !is.na(z$AAPL) 
z$AAPL <- z$AAPL[i1][cumsum(i1)] 
head(z) 
#  Date AAPL 
#1 2015-05-31 100 
#2 2015-06-01 100 
#3 2015-06-02 100 
#4 2015-06-03 100 
#5 2015-06-04 100 
#6 2015-06-05 100 
tail(z) 
#   Date AAPL 
#57 2015-07-26 200 
#58 2015-07-27 200 
#59 2015-07-28 200 
#60 2015-07-29 200 
#61 2015-07-30 200 
#62 2015-07-31 150 
+0

我更喜歡這個解決方案,因爲它速度快,它使用base R.這裏提出的其他解決方案也可以工作。非常感謝您的文章! – fuji2015

0

如果決定利用時間系列,如zoo那麼可以很容易地從na.locf動物園完成。下面是一些info

3

使用dplyrzoo包對我的作品:

library(dplyr) 
library(zoo) 

my_new_df <- 
    right_join(x, date) %>% 
    mutate(y = na.locf(AAPL)) 

head(my_new_df) 

     Date AAPL y 
1 2015-05-31 100 100 
2 2015-06-01 NA 100 
3 2015-06-02 NA 100 
4 2015-06-03 NA 100 
5 2015-06-04 NA 100 
6 2015-06-05 NA 100 

tail(my_new_df) 

     Date AAPL y 
57 2015-07-26 NA 200 
58 2015-07-27 NA 200 
59 2015-07-28 NA 200 
60 2015-07-29 NA 200 
61 2015-07-30 NA 200 
62 2015-07-31 150 150 
+0

感謝您的回答。也可能有一個基本的R方法?正如我在我的問題中所寫的那樣,不僅可以使用is.na(z [t,2] == TRUE條件,而且還可以引用另一個日期變量(比如:只有在日期是一個星期三) – fuji2015

2

你可以嘗試簡潔data.table溶液(和快速):

library(data.table) 
setkey(setDT(x),Date)[setDT(date), roll=T]