考慮下面的數據幀的每個索引:樣品n個連續的日期從隨機起始日期爲數據幀
DF = structure(list(c_number = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L), date = c("2001-01-06", "2001-01-07", "2001-01-08",
"2001-01-09", "2001-01-10", "2001-01-11", "2001-01-12", "2001-01-13",
"2001-01-14", "2001-01-15", "2001-01-16", "2001-01-17", "2001-01-18",
"2001-01-19", "2001-01-20", "2001-01-21", "2001-01-22", "2001-01-23",
"2001-01-24", "2001-01-25", "2001-01-26", "2001-01-11", "2001-01-12",
"2001-01-13", "2001-01-14", "2001-01-15", "2001-01-16", "2001-01-17",
"2001-01-18", "2001-01-19", "2001-01-20", "2001-01-21", "2001-01-22",
"2001-01-23", "2001-01-24", "2001-01-25", "2001-01-26", "2001-01-27",
"2001-01-28", "2001-01-12", "2001-01-13", "2001-01-14", "2001-01-15",
"2001-01-16", "2001-01-17", "2001-01-18", "2001-01-19", "2001-01-20",
"2001-01-21", "2001-01-22", "2001-01-23", "2001-01-24", "2001-01-25",
"2001-01-26", "2001-01-27", "2001-01-28", "2001-01-29", "2001-01-30",
"2001-01-21", "2001-01-22", "2001-01-23", "2001-01-24", "2001-01-25",
"2001-01-26", "2001-01-27", "2001-01-28", "2001-01-29", "2001-01-30",
"2001-01-31", "2001-01-24", "2001-01-25", "2001-01-26", "2001-01-27",
"2001-01-28", "2001-01-29", "2001-01-30", "2001-01-31", "2001-02-01"
), value = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("c_number",
"date", "value"), row.names = c(NA, -78L), class = "data.frame")
我有5客戶連續日期的銷售數據;對於客戶1,我有連續21日的銷售數據....客戶#5,我已經連續9日的銷售數據...:
> table(DF[, 1])
1 2 3 4 5
21 18 19 11 9
對每一個客戶我想品嚐子DF (如果該客戶至少有15個連續日期)或該客戶的所有日期(如果我沒有爲該客戶連續15個日期),那麼這個連續15天的日期。
關鍵部分是在情況1(如果我有至少15個連續日期的情況下)那些連續15天應該有一個隨機的開始日期(例如,並非總是客戶的第一個或最後15個日期)避免在分析中引入偏見。
在純R I會做:
library(dplyr)
slow_function <- function(i, DF, length_out = 15){
sub_DF = DF[DF$c_number == i, ]
if(nrow(sub_DF) <= length_out){
out_DF = sub_DF
} else {
random_start = sample.int(nrow(sub_DF) - length_out, 1)
out_DF = sub_DF[random_start:(random_start + length_out - 1), ]
}
}
a_out = lapply(1:nrow(a_1), slow_function, DF = DF, length_out = 15)
a_out = dplyr::bind_rows(a_out)
table(a_out[, 1])
1 2 3 4 5
15 15 15 11 9
但我的數據大得多,上面不能忍受緩慢的操作。在data.table/dplyr中獲得相同結果的方法有多快嗎?
編輯:生成數據的代碼。
num_customer = 10
m = 2 * num_customer
a_0 = seq(as.Date("2001-01-01"), as.Date("2001-12-31"), by = "day")
a_1 = matrix(sort(sample(as.character(a_0), m)), nc = 2)
a_2 = list()
for(i in 1:nrow(a_1)){
a_3 = seq(as.Date(a_1[i, 1]), as.Date(a_1[i, 2]), by = "day")
a_4 = data.frame(i, as.character(a_3), round(runif(length(a_3), 1)))
colnames(a_4) = c("c_number", "date", "value")
a_2[[i]] = a_4
}
DF = dplyr::bind_rows(a_2)
dim(DF)
table(DF[, 1])
dput(DF)
EDIT2:
在100K客戶DF,克里斯托夫·沃爾克的解決方案是最快的。 接下來是G.GTothendieck的(大約4倍的時間),接下來是 Nathan Werth的(另一個比G格洛騰迪克慢2倍)。 其他解決方案明顯較慢。儘管如此,所有的提案都比我的試探性的'慢'功能更快,所以感謝大家!
問題有點不清楚。對於每位員工,您想要選擇一個隨機起始日期,並在該起始點之後最多連續15天抽樣?或者,如果隨機選擇會導致員工少於15個數據點,那麼最後15個數據點? – jdobres
@jdobres:謝謝你的提問。實際上,第二種解釋('如果隨機選擇會導致員工少於15個數據點,只需拿過去15個')就是我想要的。 – user189035