創建用戶事件數據彙總表

編輯2：我意識到我可以使用dcast()來做我想做的事情。不過，我不想計算事件數據中的所有事件，只有那些在另一個數據集中指定的日期之前發生的事件。我似乎無法弄清楚如何使用dcast()中的子集參數。到目前爲止，我已經嘗試過：創建用戶事件數據彙總表

dcast(dt.events, Email ~ EventType, fun.aggregate = length, subset = as.Date(Date) <= 
as.Date(dt.users$CreatedDate[dt.users$Email = dt.events$Email]))

但是這不起作用。我可以將CreatedDate列從dt.users添加到dt.events。然後子集使用：

dcast(dt.events, Email ~ EventType, fun.aggregate = length, subset = as.Date(Date) <= 
as.Date(CreatedDate)

我想知道是否有可能做到這一點，而不必添加額外的列？

編輯：剛剛計算，它可能需要大約37小時才能完成我目前正在做的事情，所以如果任何人有任何提示，使其更快。請讓我知道:)

我是R的新手，我想出了一種方法來做我想做的事情。但效率極低，需要數小時才能完成。

我有以下幾點：

事件數據：

UserID Email   EventType Date 

User1  [email protected]*.com Type2  2016-01-02 
User1  [email protected]*.com Type6  2016-01-02 
User1  [email protected]*.com Type1  2016-01-02 
User1  [email protected]*.com Type3  2016-01-02 
User2  [email protected]*.com Type1  2016-01-02 
User2  [email protected]*.com Type1  2016-01-02 
User2  [email protected]*.com Type2  2016-01-02 
User3  [email protected]*.com Type1  2016-01-02 
User3  [email protected]*.com Type3  2016-01-02 
User1  [email protected]*.com Type2  2016-01-04 
User1  [email protected]*.com Type2  2016-01-04 
User2  [email protected]*.com Type5  2016-01-04 
User3  [email protected]*.com Type1  2016-01-04 
User3  [email protected]*.com Type4  2016-01-04

用戶每次做一些事情，記錄一個事件與事件類型，帶有時間戳。從不同的數據庫

用戶列表：

UserID Email   CreatedDate 

DxUs1  [email protected]*.com 2016-01-01 
DxUs2  [email protected]*.com 2016-01-03 
DxUs3  [email protected]*.com 2016-01-03

我希望得到以下幾點：

一個彙總清單，其對事件數據中的每個事件類型的數量在每個用戶用戶列表。但是，只有在用戶列表中的「CreatedDate」等於事件數據中的「Date」之前，才應該計算事件。

因此，對於上述數據我最終想：

Email   Type1 Type2 Type3 Type4  Type5  Type6 
[email protected]*.com 1  3  1  0   0   1 
[email protected]*.com 0  0  1  0   1   0 
[email protected]*.com 1  0  0  1   0   0

如何我已經成功地做到這一點至今

我已經能夠通過首先創建要做到這一點一個'dt.master'data.table，其中包括所有事件的列和電子郵件列表。它看起來像這樣：

Email   Type1 Type2 Type3 Type4  Type5  Type6 
[email protected]*.com 0  0  0  0   0   0 
[email protected]*.com 0  0  0  0   0   0 
[email protected]*.com 0  0  0  0   0   0

，然後用下面的while循環填寫此表：

# The data sets 
dt.events # event data 
dt.users # user list 
dt.master # blank master table 

# Loop that fills master table 
counter_limit = group_size(dt.master) 
index = 1 

while (index <= counter_limit) { 

    # Get events of user at current index 
    dt.events.temp = filter(dt.events, dt.events$Email %in% dt.users$Email[index], 
        as.Date(dt.events$Date) <= as.Date(dt.users$CreatedDate[index])) 

    # Count all the different events 
    dt.event.counter = as.data.table(t(as.data.table(table(dt.events.temp$EventType)))) 

    # Clean the counter by 1: Rename columns to event names, 2: Remove event names row 
    names(dt.event.counter) = as.character(unlist(dt.event.counter[1,])) 
    dt.event.counter = dt.event.counter[-1] 

    # Fill the current index in on the blank master table 
    set(dt.master, index, names(dt.event.counter), dt.event.counter) 

    index = index + 1 
}

的問題

這不工作...但是我處理900多萬個活動，250多個用戶，150多個活動類型。因此，上面的while循環在處理之前需要HOURS。我用一小批500名用戶測試了它，它有以下處理時間：

user system elapsed 
179.33 62.92  242.60

我還在等待整批處理哈哈。我讀過的地方應該避免循環，因爲它們需要很長時間。不過，我對R和編程一般都是全新的，並且我一直在通過試驗/錯誤和谷歌搜索學習任何我需要的東西。很明顯，這會導致一些混亂的代碼。我想知道是否有人可以將我指向可能更快/更高效的方向？

謝謝！

編輯：剛剛計算，它可能需要大約37小時才能完成我目前正在做的事情，所以如果任何人有任何提示，使其更快。請讓我知道:)

TL，DR：我的事件彙總/彙總代碼需要幾個小時來處理我的數據（它還沒有完成）。有沒有更快的方法來做到這一點？

來源

2017-02-10 Mark

你應該檢查長/寬格式 - '？reshape（）' – BigDataScientist

我回滾了編輯，因爲解決方案屬於答案，希望你不介意 – Jaap

此外：祝賀你的第一個問題！良好的公式化，因此是所有新用戶imo的例子。 – Jaap

假設你的數據已經在data.table，你可以在dcast使用fun.aggregate參數：

dcast(dat, Email ~ EventType, fun.aggregate = length)

給出：

  Email Type1 Type2 Type3 Type4 Type5 Type6 
1: [email protected]*.com  1  2  1  0  0  1 
2: [email protected]*.com  4  1  0  0  1  0 
3: [email protected]*.com  0  1  1  1  0  0

在迴應的評論&更新的問題：

dcast(dt.events[dt.users, on = .(Email, Date >= CreatedDate)], 
     Email ~ EventType, fun.aggregate = length)

這給：您可以通過使用非等距得到想要的結果dcast -function內加入

  Email Type1 Type2 Type3 Type4 Type5 Type6 
1: [email protected]*.com  1  2  1  0  0  1 
2: [email protected]*.com  1  0  0  0  1  0 
3: [email protected]*.com  0  1  0  1  0  0

來源

2017-02-10 13:51:30 Jaap

這就是我一直在尋找的。謝謝！但是我仍然需要確保不會拋出/計數所有事件，只有在另一個表中的日期之前。我看到dcast有一個子集參數，我會嘗試使用它！ – Mark

Hi Jaap，我想使用dcast，但也使用其他數據表子集。我嘗試過： 'dcast（dt.events，Email〜EventType，fun.aggregate = length，subset = as.Date（Date）<= as.Date（dt.users $ CreatedDate [dt.users $ Email = dt （eval（subset，data，parent.frame（）））：'which'的參數不是邏輯的錯誤' 我的錯誤代碼：我的錯誤代碼：我的錯誤代碼：假設是因爲子集邏輯不正確。是否有另一種方法對它進行分類，或者我應該先將CreatedDate添加到dt.events中，然後將其用於子集？ – Mark

@Mark查看我的回答更新 – Jaap

未經檢驗

library(dpylr) 
library(tidyr) 
your.dataset %>% 
    count(Email, EventType) %>% 
    spread(EventType, n)

來源

2017-02-10 13:48:47 Thierry

創建用戶事件數據彙總表

回答

相關問題