2016-07-22 93 views
2

我有一個訂閱數據幀如下所示。 大約有100萬個唯一ID。 該表列出訂閱狀態。當用戶開始訂閱服務時,狀態字段用'Sub'表示,當用戶取消訂閱時,用'Usub'表示。填補空白行依賴於一個/下一個非空值

dat <- data.frame(ID=c(rep("A",12),(rep("B",12))), Year="2014", Month=rep(seq(1:12),2), Status=NA) 
dat$Status[4] <- "Sub" 
dat$Status[8] <- "Usub" 
dat$Status[17] <- "Usub" 
dat$Status[21] <- "Sub" 

ID Year Month Status 
A 2014 1  
A 2014 2  
A 2014 3  
A 2014 4 Sub 
A 2014 5  
A 2014 6  
A 2014 7  
A 2014 8 Usub 
A 2014 9  
A 2014 10  
A 2014 11  
A 2014 12  
B 2014 1  
B 2014 2  
B 2014 3  
B 2014 4  
B 2014 5 Usub  
B 2014 6  
B 2014 7  
B 2014 8  
B 2014 9 Sub 
B 2014 10  
B 2014 11  
B 2014 12  
C 2014 1  . 
. . .  . 
. . .  . 

我正在尋找填補每個狀態更新之間的差距。 所需的輸出表將如下所示:

ID Year Month Status 
A 2014 1 Usub 
A 2014 2 Usub 
A 2014 3 Usub 
A 2014 4 Sub 
A 2014 5 Sub 
A 2014 6 Sub 
A 2014 7 Sub 
A 2014 8 Usub 
A 2014 9 Usub 
A 2014 10 Usub 
A 2014 11 Usub 
A 2014 12 Usub 
B 2014 1 Sub 
B 2014 2 Sub 
B 2014 3 Sub 
B 2014 4 Sub 
B 2014 5 Usub 
B 2014 6 Usub 
B 2014 7 Usub 
B 2014 8 Usub 
B 2014 9 Sub 
B 2014 10 Sub 
B 2014 11 Sub 
B 2014 12 Sub 
C 2014 1  . 
. . .  . 
. . .  . 

每個ID具有至少一個狀態值。 如果一個ID的第一個狀態記錄是「Usub」,那麼以前所有月份的狀態都是「Sub」。 (像2014/05 ID B)與此相反,如果第一個狀態記錄與「子」開始,所有的前幾個月的地位是‘Usub’

+0

@MrFlick最後一次觀察可能不會在第一3行爲ID = A和第一4行爲ID = B工作。 – ohmyan

+0

@MrFlick認購數據是不完整的,這意味着第一個非空的狀態可能不是「分」,這可能是「Usub」,在這種情況下,所有的前行實際上是「子」。 – ohmyan

+0

@MrFlick正如帖子中所述。每個ID至少有一個狀態值。 – ohmyan

回答

3

您可以生成交替順序即相當於你想要的狀態通過減去Status == "Usub"Status = "Sub",以這種方式,應該用Sub填充的所有位置將具有比應該用Usub填充的那些值更低的值,然後使用可以以有序方式標記因子以將整數序列轉換爲一個因素:

library(dplyr) 
df %>% group_by(ID) %>% mutate(Status = factor(cumsum((Status == "Usub") - (Status == "Sub")), 
               labels = c("Sub", "Usub"))) 

# ID Year Month Status 
# 1 A 2014  1 Usub 
# 2 A 2014  2 Usub 
# 3 A 2014  3 Usub 
# 4 A 2014  4 Sub 
# 5 A 2014  5 Sub 
# 6 A 2014  6 Sub 
# 7 A 2014  7 Sub 
# 8 A 2014  8 Usub 
# 9 A 2014  9 Usub 
# 10 A 2014 10 Usub 
# 11 A 2014 11 Usub 
# 12 A 2014 12 Usub 
# 13 B 2014  1 Sub 
# 14 B 2014  2 Sub 
# 15 B 2014  3 Sub 
# 16 B 2014  4 Sub 
# 17 B 2014  5 Usub 
# 18 B 2014  6 Usub 
# 19 B 2014  7 Usub 
# 20 B 2014  8 Usub 
# 21 B 2014  9 Sub 
# 22 B 2014 10 Sub 
# 23 B 2014 11 Sub 
# 24 B 2014 12 Sub 

相應data.table方式將是:

library(data.table) 
setDT(df)[, Status := as.character(factor(cumsum((Status == "Usub") - (Status == "Sub")), labels = c("Sub", "Usub"))), .(ID)] 

您必須將新因子轉換回字符類,因爲它在創建新列時不允許改變類型。

數據假定你有空字符串,而不是NA

structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B"), Year = c("2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014"), Month = c("1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11", "12", "1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11", "12"), Status = c("", "", "", "Sub", 
"", "", "", "Usub", "", "", "", "", "", "", "", "", "Usub", "", 
"", "", "Sub", "", "", "")), .Names = c("ID", "Year", "Month", 
"Status"), row.names = c(NA, 24L), class = "data.frame") 
+0

這是整潔!它輸出的是我正在尋找的東西。然而,我仍然有理解這個操作的問題:(Status ==「Usub」) - (Status ==「Sub」)。不完全確定它是如何工作的。你介意更多詳情嗎?謝謝! – ohmyan

+0

這假定你總是有替代'Usub'和'Sub'。通過這樣做減法,你'1'所有'Usub'和'-1'所有'Sub'和序列的cumsum將使'0'和'1'或者'0'和交替序列' -1',具體取決於'Usub'和'Sub'中哪一個先到達。減法還可以確保Sub之後的所有空字符串在前一種情況下都是較低值「0」,在後一種情況下是「-1」。然後,如果您從中構建了一個因子,那麼您應該知道「Sub」應該是第一個標籤,因爲它與上面解釋的較低的值相對應。 – Psidom

0
uniquevector<-unique(dat$ID) 
for(i in uniquevector){ 
    zzz <- which(dat$ID==i & dat$Status == "Sub") 
    zzz2 <- which(dat$ID==i & dat$Status == "Usub") 
    zzz3 <- which(dat$ID==i & dat$Month == 12) 
    zzz4 <- which(dat$ID==i & dat$Month == 1) 
    if(zzz2 > zzz){ 
    index<-zzz:(zzz2-1) 
    dat$Status[index] <- "Sub" 
} 
    if(zzz2 < zzz){ 
    index<-zzz2:(zzz-1) 
    dat$Status[index] <- "Usub" 
    } 
    if(zzz3 > zzz2 & zzz < zzz2){ 
    index<-zzz2:zzz3 
    dat$Status[index] <- "Usub" 
    } 
if(zzz2 < zzz & zzz3 > zzz){ 
    index<-zzz:zzz3 
    dat$Status[index] <- "Sub" 
if((zzz4 < zzz) & zzz < zzz2){ 
    index<-zzz4:(zzz-1) 
    dat$Status[index] <- "Usub" 
} 
    if((zzz4 < zzz2) & zzz2 < zzz){ 
    index<-zzz4:(zzz2-1) 
    dat$Status[index] <- "Sub" 
    } 

    }} 
0

另一種選擇是空白""轉換爲NA和利用na.locfzoo包與更換NA非NA以前的元素。由於這是一組手術,我們也可以通過avebase R這樣做。因爲「最後的觀察」不可用結轉

library(zoo) 
df$Status <- with(df, ave(replace(Status, !nzchar(Status), NA), ID, 
      FUN = function(x){ x1 <- na.locf(x, na.rm=FALSE) 
     replace(x1, is.na(x1), setdiff(unique(na.omit(x1)), x1[!is.na(x1)][1]))})) 
df$Status 
#[1] "Usub" "Usub" "Usub" "Sub" "Sub" "Sub" "Sub" "Usub" "Usub" "Usub" "Usub" "Usub" "Sub" "Sub" "Sub" "Sub" "Usub" "Usub" "Usub" 
#[20] "Usub" "Sub" "Sub" "Sub" "Sub"