2017-10-07 78 views
0

我想基於滯後觀察創建二進制/指標變量。我有一個變量X1。原始數據如下所示。這是一個示例數據。原始數據接近10K記錄。基於R中的滯後創建二進制變量

X1 
Diagnosis 
1 
2 
3 
4 
Treatment 
1 
2 
3 

我所要的輸出是這樣的:

X1   NewVar 
Diagnosis Diagnosis 
1   Diagnosis 
2   Diagnosis 
3   Diagnosis 
4   Diagnosis 
Treatment Treatment 
1   Treatment 
2   Treatment 
3   Treatment 

任何幫助,將不勝感激!

+1

顯示什麼的第一個元素您已採取措施解決此問題。向我們展示代碼以及哪個特定部分導致問題。 –

回答

1

您可以通過cumsum實現此目的。 cumsum可以在每次出現DiagnosisTreatment時創建一個新組。然後每組中的NewVar將採取的第一X1值這一組中:

library(dplyr) 

dtf %>% 
    mutate(g = cumsum(X1 == 'Diagnosis' | X1 == 'Treatment')) %>% 
    group_by(g) %>% 
    mutate(NewVar = X1[1]) %>% 
    ungroup() %>% select(-g) 
# # A tibble: 9 x 2 
#   X1 NewVar 
# <fctr> <fctr> 
# 1 Diagnosis Diagnosis 
# 2   1 Diagnosis 
# 3   2 Diagnosis 
# 4   3 Diagnosis 
# 5   4 Diagnosis 
# 6 Treatment Treatment 
# 7   1 Treatment 
# 8   2 Treatment 
# 9   3 Treatment 

dtf在上面的代碼:

> dput(dtf) 
structure(list(X1 = structure(c(5L, 1L, 2L, 3L, 4L, 6L, 1L, 2L, 
3L), .Label = c("1", "2", "3", "4", "Diagnosis", "Treatment"), class = "factor")), .Names = "X1", class = "data.frame", row.names = c(NA, 
-9L)) 
0

下面是data.table一個選項。轉換爲「data.table」(setDT(dtf)後,得到基於「X1」的值作爲字符邏輯矢量的累積和,並分配「NewVar」作爲「X1」(X1[1]

library(data.table) 
setDT(dtf)[, NewVar := X1[1], cumsum(grepl('^[A-Za-z]+$', X1))] 
dtf 
#   X1 NewVar 
#1: Diagnosis Diagnosis 
#2:   1 Diagnosis 
#3:   2 Diagnosis 
#4:   3 Diagnosis 
#5:   4 Diagnosis 
#6: Treatment Treatment 
#7:   1 Treatment 
#8:   2 Treatment 
#9:   3 Treatment