R：通過ID彙總歷史記錄日期

我有一個龐大的數據集，它具有個人以及日期的唯一ID，並且每個人都能夠多次遇到。R：通過ID彙總歷史記錄日期

下面是代碼和這個數據可能外觀的示例：

strDates <- c("09/09/16", "6/7/16", "5/6/16", "2/3/16", "2/1/16", "11/8/16",  
"6/8/16", "5/8/16","2/3/16","1/1/16") 
Date<-as.Date(strDates, "%m/%d/%y") 
ID <- c("A", "A", "A", "A","A","B","B","B","B","B") 
Event <- c(1,0,1,0,1,0,1,1,1,0) 
sample_df <- data.frame(Date,ID,Event) 

sample_df 

     Date ID Event 
1 2016-09-09 A  1 
2 2016-06-07 A  0 
3 2016-05-06 A  1 
4 2016-02-03 A  0 
5 2016-02-01 A  1 
6 2016-11-08 B  0 
7 2016-06-08 B  1 
8 2016-05-08 B  1 
9 2016-02-03 B  1 
10 2016-01-01 B  0

我想保持每遇到的所有附屬信息，但隨後彙總由ID下面的歷史信息

以前的遭遇人數
前期活動數量

舉例來說，讓我們看第2行。

第2行是ID A，因此我會引用第3-5行（發生在第2行遭遇之前）。在這組行中，我們看到Row 3 & 5都有事件發生。

上遭遇的號排2 = 3

爲行2之前的活動數= 2

理想情況下，我會得到下面的輸出：

  Date ID Event PrevEnc PrevEvent 
1 2016-09-09 A  1  4   2 
2 2016-06-07 A  0  3   2 
3 2016-05-06 A  1  2   1 
4 2016-02-03 A  0  1   1 
5 2016-02-01 A  1  0   0 
6 2016-11-08 B  0  4   3 
7 2016-06-08 B  1  3   2 
8 2016-05-08 B  1  2   1 
9 2016-02-03 B  1  1   0 
10 2016-01-01 B  0  0   0

到目前爲止，我已經嘗試在dplyr中通過mutate和總結來解決這個問題，兩者都沒有讓我成功地將我的聚合限制爲以前針對特定ID發生的事件。我用If-then語句嘗試了一些亂七八糟的For循環，但真的只是想知道是否有包或技術來簡化這個過程。

謝謝！

來源

2016-11-11 EntryLevelR

最大的障礙是當前的排序順序。在這裏，我存儲了一個原始索引點，後來我用它對數據進行重新排序（然後將其刪除）。除此之外，基本思想是從0開始計數遇到的事件，並使用cumsum來計數發生的事件。爲此，lag用於避免計算當前事件。

sample_df %>% 
    mutate(origIndex = 1:n()) %>% 
    group_by(ID) %>% 
    arrange(ID, Date) %>% 
    mutate(PrevEncounters = 0:(n() -1) 
     , PrevEvents = cumsum(lag(Event, default = 0))) %>% 
    arrange(origIndex) %>% 
    select(-origIndex)

給人

  Date  ID Event PrevEncounters PrevEvents 
     <date> <fctr> <dbl>   <int>  <dbl> 
1 2016-09-09  A  1    4   2 
2 2016-06-07  A  0    3   2 
3 2016-05-06  A  1    2   1 
4 2016-02-03  A  0    1   1 
5 2016-02-01  A  1    0   0 
6 2016-11-08  B  0    4   3 
7 2016-06-08  B  1    3   2 
8 2016-05-08  B  1    2   1 
9 2016-02-03  B  1    1   0 
10 2016-01-01  B  0    0   0

來源

2016-11-11 16:19:30

'0：（n（）-1）'是'row_number（） - 1L'？另外，我猜orig index可以是'row_number（）'。 – Frank

是的，@Frank - 這些應該是等價的。我不知道爲什麼我沒有更頻繁地使用'row_number（）'。有可能是一種懶惰的預習式方法。 –

謝謝你非常有幫助的方式來查看這個！滯後是def。我不知道的東西，現在很高興收到！ – EntryLevelR

由於@Frank和@MarkPeterson指出，這裏的最大障礙是，Date列按降序排列。不需要訴諸的Date列的另一種方法：

library(dplyr) 
res <- sample_df %>% group_by(ID) %>% 
        mutate(PrevEnc=n()-row_number(), 
          PrevEvent=rev(cumsum(lag(rev(Event), default=0))))

在這裏，我們使用row_number()來確定行索引和n()確定的行數（由ID分組）。由於Date按降序排列，因此以前的相遇次數僅爲n()-row_number()。爲了計算先前事件的數量，我們再次利用Date列按降序排序並使用rev來顛倒Event列的順序，此列反轉之前爲cumsum,lag。然後，我們再次使用rev將結果反轉回原始順序。

使用您的數據：

print(res) 
##Source: local data frame [10 x 5] 
##Groups: ID [2] 
## 
##   Date  ID Event PrevEnc PrevEvent 
##  <date> <fctr> <dbl> <int>  <dbl> 
##1 2016-09-09  A  1  4   2 
##2 2016-06-07  A  0  3   2 
##3 2016-05-06  A  1  2   1 
##4 2016-02-03  A  0  1   1 
##5 2016-02-01  A  1  0   0 
##6 2016-11-08  B  0  4   3 
##7 2016-06-08  B  1  3   2 
##8 2016-05-08  B  1  2   1 
##9 2016-02-03  B  1  1   0 
##10 2016-01-01  B  0  0   0

來源

2016-11-11 16:28:19 aichao

或者，如果你想嘗試與data.table，您可以使用此：

library(data.table) 

# Convert to data.table and sort 
sample_dt <- as.data.table(sample_df) 
sample_dt <- sample_dt[order(Date)] 

# Count only the previous Events with 1 
sample_dt[, prevEvent := ifelse(Event == 1, cumsum(Event) - 1, cumsum(Event)), by = "ID"] 

# .I gives the row number, and .SD contains the Subset of the Data for each group 
sample_dt[, prevEnc := .SD[,.I - 1], by = "ID"] 

print(sample_dt) 
      Date ID Event prevEvent prevEnc 
1: 2016-01-01 B  0   0  0 
2: 2016-02-01 A  1   0  0 
3: 2016-02-03 A  0   1  1 
4: 2016-02-03 B  1   0  1 
5: 2016-05-06 A  1   1  2 
6: 2016-05-08 B  1   1  2 
7: 2016-06-07 A  0   2  3 
8: 2016-06-08 B  1   2  3 
9: 2016-09-09 A  1   2  4 
10: 2016-11-08 B  0   3  4

如果你不知道這個package，有一個很好的cheat sheet大部分的操作。

來源

2016-11-11 17:07:36

而不是caclulate'cumsum（Event）'兩次，爲什麼不只'cumsum（Event） - （Event == 1）' – MichaelChirico

@MichaelChirico好點。我沒有想過這個。 –

R：通過ID彙總歷史記錄日期

回答

相關問題