2017-04-17 49 views
1

我有一個數據集,其中包含一個人離開網絡時的日期。一個人可以多次離開網絡,因爲他們可能在離開網絡後再次加入網絡。以下代碼複製該場景。如何在R中構建高效循環查找

library(data.table) 
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05"))) 

(IDS在此表中重複多次作爲一個人可以離開網絡多次給定他們又加入了它)

> Leaving_Date 
    Id  Date 
1: 1 2017-01-01 
2: 2 2017-02-03 
3: 3 2017-01-01 
4: 4 2017-03-10 
5: 3 2017-02-09 
6: 5 2017-02-05 

我有另外一個數據集給的日期,當特定的人之後這可以在他們離開網絡之前或之後進行。以下代碼複製該場景。

FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5), 
         Date =as.Date(c("2016-10-01","2017-02-04", 
         "2017-01-17","2017-02-23", "2017-03-03", 
         "2017-02-10","2017-02-11","2017-01-01", 
         "2017-01-15","2017-01-01"))) 


> FOLLOWUPs 
    Id  Date 
1: 1 2016-10-01 
2: 2 2017-02-04 
3: 3 2017-01-17 
4: 2 2017-02-23 
5: 2 2017-03-03 
6: 3 2017-02-10 
7: 3 2017-02-11 
8: 4 2017-01-01 
9: 1 2017-01-15 
10: 5 2017-01-01 

現在我想查找在Leaving_Date每種情況下,發現當他們進行隨訪日期和創建三列(SevenDay,FourteenDay,ThirtyDay),表明後續的時間段中的0(櫃面,如果有任何)和1秒。我使用下面的代碼:

SEVENDAY_FOLLOWUP <- vector() 
FOURTEEN_FOLLOWUP <- vector() 
THIRTYDAY_FOLLOWUP <- vector() 
for(i in 1:nrow(Leaving_Date)){ 
    sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]] 
    if(nrow(sub_data[Date > Leaving_Date[i,Date] & 
        Date < (Leaving_Date[i,Date]+7)])== 0){ 
    SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0) 
    } 
    else{ 
    SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1) 
    } 

    if(nrow(sub_data[Date > Leaving_Date[i,Date] & 
        Date < (Leaving_Date[i,Date]+14)])== 0){ 
    FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0) 
    } 
    else{ 
    FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1) 
    } 

    if(nrow(sub_data[Date > Leaving_Date[i,Date] & 
        Date < (Leaving_Date[i,Date]+30)])== 0){ 
    THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0) 
    } 
    else{ 
    THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1) 
    } 
}    


Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP) 
Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP) 
Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP) 

最終數據

> Leaving_Date 
    Id  Date SEVENDAY FOURTEENDAY THIRTYDAY 
1: 1 2017-01-01  0   0   1 
2: 2 2017-02-03  1   1   1 
3: 3 2017-01-01  0   0   1 
4: 4 2017-03-10  0   0   0 
5: 3 2017-02-09  1   1   1 
6: 5 2017-02-05  0   0   0 

此代碼是非常低效的,因爲我要運行它100K的觀察,它需要大量的時間。有沒有任何有效的方法來做到這一點。

+1

您可能想要閱讀[R Inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf)的第二個圓圈 – shayaa

+1

@Frank我編輯了它 –

回答

4

使用非相等連接:

setorder(FOLLOWUPs, Id, Date) 
Leaving_Date[, n := 
    FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date] 
] 

    Id  Date  n 
1: 1 2017-01-01 14 days 
2: 2 2017-02-03 1 days 
3: 3 2017-01-01 16 days 
4: 4 2017-03-10 NA days 
5: 3 2017-02-09 1 days 
6: 5 2017-02-05 NA days 

從開關DateIDate可能會使這個速度快兩倍。請參閱?IDate


我認爲這是最好停在這裏,但n可以針對7相比,14,30,如果有必要,像

Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]] 

    Id  Date  n bin 
1: 1 2017-01-01 14 days 30 
2: 2 2017-02-03 1 days 7 
3: 3 2017-01-01 16 days 30 
4: 4 2017-03-10 NA days NA 
5: 3 2017-02-09 1 days 7 
6: 5 2017-02-05 NA days NA 

邊注:請不要給表名喜歡這個。

0

我認爲這是你使用dplyr尋找的。

它通過Id執行'內部連接' - 爲給定的Id在兩個數據框中生成日期的所有組合 - 然後計算日期差異,按Id編組,然後檢查是否存在落入範圍內的值你的三個類別。

library(dplyr) 

Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>% 
    mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>% 
    summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)), 
      FOURTEENDAY=as.numeric(any(datediff %in% 0:13)), 
      THIRTYDAY=as.numeric(any(datediff %in% 0:29))) 
+0

如果您更改了%'語句中'datediff%'來自'0:n',您的最終結果與所需結果相符。 –

+0

啊 - 我明白你現在要做的是什麼!以上修改。感謝您發現! –

0

我們可以做爲查詢而不是循環。首先,我清理了你的data.tables,因爲我被變量名弄糊塗了。

爲了使比較步驟更容易,我們首先預先計算7,14和30天閾值的跟蹤日期限制。

library(dplyr) 

dt_leaving_neat = Leaving_Date %>% 
    mutate(.id = 1:n()) %>% 
    mutate(limit_07 = Date + 7) %>% 
    mutate(limit_14 = Date + 14) %>% 
    mutate(limit_30 = Date + 30) %>% 
    rename(id = .id, id_person = Id, leaving_date = Date) 

dt_follow_neat = FOLLOWUPs %>% 
    select(id_person = Id, followed_up_date = Date) 

的實際操作中僅僅是一個查詢。爲了便於閱讀,它在dplyr中寫出,但如果速度是您的主要問題,則可以將其轉換爲data.table。我建議在管道中執行每一步,以確保您瞭解正在發生的事情。

dt_followed_up = dt_leaving_neat %>% 
    tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>% 
    left_join(dt_follow_neat, by = "id_person") %>% 
    mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>% 
    select(id, id_person, leaving_date, follow_up, followed_up) %>% 
    filter(followed_up == TRUE) %>% 
    unique() %>% 
    tidyr::spread(follow_up, followed_up, fill = 0) %>% 
    select(id, id_person, leaving_date, limit_07, limit_14, limit_30) 

的想法是加入了離開日期跟進日期和檢查隨訪日期是否是閾值內(也是離開日期後,如想必你不能在離開前跟進) 。

然後進行一些最終清理以返回您所需的格式。您也可以使用selectrename來更改列名稱。

dt_result = dt_leaving_neat %>% 
    select(id, id_person, leaving_date) %>% 
    left_join(dt_followed_up, by = c("id", "id_person", "leaving_date")) 

dt_result[is.na(dt_result)] = 0 

結果

> dt_result 
    id id_person leaving_date limit_07 limit_14 limit_30 
1 1   1 2017-01-01  0  0  1 
2 2   2 2017-02-03  1  1  1 
3 3   3 2017-01-01  0  0  1 
4 4   4 2017-03-10  0  0  0 
5 5   3 2017-02-09  1  1  1 
6 6   5 2017-02-05  0  0  0 

而繼安德魯的回答,等效1線data.table SOLN是

​​