Python的大熊貓不正確的日期計

與下面的Python大熊貓據幀「DF」工作：Python的大熊貓不正確的日期計

Customer_ID | Transaction_ID | Item_ID 
ABC   2017-04-12-333 X8973 
ABC   2017-04-12-333 X2468 
ABC   2017-05-22-658 X2906 
ABC   2017-05-22-757 X8790 
ABC   2017-07-13-864 X8790  
BCD   2017-08-11-879 X2346 
BCD   2017-08-11-879 X2468

我想算的交易有一列表示，當它的客戶端的第一次交易，第二個交易等按日期排列。（如果同一天有兩筆交易，我將它們統計爲同一筆數字，因爲我沒有時間，所以我不知道哪一筆先到了 - 基本上把它們當作一筆交易處理）。

#get the date out of the Transaction_ID string 
df['date'] = pd.to_datetime(df.Transaction_ID.str[:10]) 

#calculate the transaction number 
df['trans_nr'] = df.groupby(['Customer_ID',"Transaction_ID", df['date'].dt.year]).cumcount()+1

不幸的是，這是我用上面的代碼輸出：

Customer_ID | Transaction_ID | Item_ID | date  | trans_nr 
ABC   2017-04-12-333 X8973  2017-04-12  1 
ABC   2017-04-12-333 X2468  2017-04-12  2 
ABC   2017-05-22-658 X2906  2017-05-22  1 
ABC   2017-05-22-757 X8790  2017-05-22  1 
ABC   2017-07-13-864 X8790  2017-07-13  1 
BCD   2017-08-11-879 X2346  2017-08-11  1 
BCD   2017-08-11-879 X2468  2017-08-11  2

這是不正確，這是正確的輸出我要找：

Customer_ID | Transaction_ID | Item_ID | date  | trans_nr 
ABC   2017-04-12-333 X8973  2017-04-12  1 
ABC   2017-04-12-333 X2468  2017-04-12  1 
ABC   2017-05-22-658 X2906  2017-05-22  2 
ABC   2017-05-22-757 X8790  2017-05-22  2 
ABC   2017-07-13-864 X8790  2017-07-13  3 
BCD   2017-08-11-879 X2346  2017-08-11  1 
BCD   2017-08-11-879 X2468  2017-08-11  1

也許邏輯應僅基於Customer_ID和日期（沒有Transaction_ID）？

我想這

df['trans_nr'] = df.groupby(['Customer_ID','date').cumcount()+1

但也算正確。

來源

2017-10-17 jeangelj

你能解釋一下trans_nr = 1爲seconrd記錄。當我運行你的代碼時，trans_nr爲第二條記錄= 2.我得到[1 2 1 1 1 1 2]不是[1 1 1 2 2 1 2] –

對不起 - 我正在試驗計數並粘貼錯誤 - 我需要得到[1 1 2 2 3 1 1]，儘管 – jeangelj

爲什麼第二個記錄1？前兩個記錄有什麼不同，我只看到Item_ID？ –

讓我們嘗試：

df['trans_nr'] = df.groupby(['Customer_ID', df['date'].dt.year])['date']\ 
        .transform(lambda x: (x.diff() != pd.Timedelta('0 days')).cumsum())

輸出：

Customer_ID Transaction_ID Item_ID  date trans_nr 
0   ABC 2017-04-12-333 X8973 2017-04-12   1 
1   ABC 2017-04-12-333 X2468 2017-04-12   1 
2   ABC 2017-05-22-658 X2906 2017-05-22   2 
3   ABC 2017-05-22-757 X8790 2017-05-22   2 
4   ABC 2017-07-13-864 X8790 2017-07-13   3 
5   BCD 2017-08-11-879 X2346 2017-08-11   1 
6   BCD 2017-08-11-879 X2468 2017-08-11   1

來源

2017-10-17 15:02:56

ngroups可能會很簡單。 – Dark

謝謝 - 現在測試它，但似乎工作 – jeangelj

@jeangelj不客氣。快樂的編碼！ –

的一種方法是使累計計數之前下降重複值：

trans_nr = (df 
     .drop_duplicates(subset=['Customer_ID', 'date']) 
     .set_index(['Customer_ID', 'date']) 
     .groupby(level='Customer_ID') 
     .cumcount() + 1 
    ) 
df.set_index(['Customer_ID', 'date'], inplace=True) 
df['trans_nr'] = trans_nr 
df.reset_index(inplace=True)

要獲得交易編號，你先用重複Customer_ID和date值刪除行。然後，您使用Customer_ID和date（稍後合併）設置其索引並執行groupby和cumcount。這產生了一個系列，其值是每個Customer_ID和date的累計計數。

您還設置原始數據幀的索引（再次以允許合併）。然後，您只需將trans_nr系列分配到df中的一列。指數處理合並邏輯。

來源

2017-10-17 15:02:17 ASGM

好點 - 一定是錯過了。 – ASGM

使用dual groupby與ngroup()即

df['trans_nr'] = df.groupby('Customer_ID').apply(lambda x : \ 
       x.groupby([x['date'].dt.date]).ngroup()+1).values

 
Customer_ID Transaction_ID Item_ID  date trans_nr 
0   ABC 2017-04-12-333 X8973 2017-04-12   1 
1   ABC 2017-04-12-333 X2468 2017-04-12   1 
2   ABC 2017-05-22-658 X2906 2017-05-22   2 
3   ABC 2017-05-22-757 X8790 2017-05-22   2 
4   ABC 2017-07-13-864 X8790 2017-07-13   3 
5   BCD 2017-08-11-879 X2346 2017-08-11   1 
6   BCD 2017-08-11-879 X2468 2017-08-11   1

來源

2017-10-17 15:09:50 Dark

您的解決方案使用兩個groupbys而不是一個。我認爲這將比我的慢兩倍。你可以使用ngroup將它凝聚成一個羣組嗎？ –

我同意，因爲它是假設重置我使用了兩次。讓我嘗試 – Dark

Python的大熊貓不正確的日期計

回答

相關問題