計數GROUPBY對象

內的連續日期觀測這是我的工作數據幀的一個示例：計數GROUPBY對象

d = { 
'item_number':['bdsm1000', 'bdsm1000', 'bdsm1000', 'ZZRWB18','ZZRWB18', 'ZZRWB18', 'ZZRWB18', 'ZZHP1427BLK', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1414', 'ZZHP1414', 'ZZHP1414', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE'], 
'Comp_ID':[2454, 2454, 2454, 1395, 1395, 1395, 1395, 3378, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 43978, 43978, 43978, 1197347907, 70745, 4737, 1197347907, 4737, 1197347907, 70745, 4737, 1197347907, 70745, 4737, 1197347907, 4737, 1197487704, 1197347907, 70745, 23872, 4737, 1197347907, 4737, 1197487704, 1197347907, 23872, 4737, 1197487704, 1197347907, 70745], 
'date':['2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19']} 

df = pd.DataFrame(data=d) 
df.date = pd.to_datetime(df.date)

我想計數連續觀測從2016年11月22日開始，有按Comp_ID和item_number分組。

本質上，我期待做的是計算連續有多少天，每個Comp_ID和item_number有一個從今天的日期開始計數的觀察值。（這個例子是在11月22日整理的）在今天之前的幾天/幾天觀察到的連續觀察並不相關。只有像今天......昨天......前天...等等的序列是相關的。

我得到這個工作在一個較小的樣本，但它似乎越來越絆倒在一個更大的數據集。

以下是較小樣本的代碼。我需要通過數千個賣家/物品的觀察來查找連續日期。出於某種原因，下面的代碼不適用於較大的數據集。

d = {'item_number':['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'], 
'Comp_ID':['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'], 
'date':['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']} 

df = pd.DataFrame(data=d) 
df.date = pd.to_datetime(df.date) 
d = pd.Timedelta(1, 'D') 

df = df.sort_values(['item_number','date','Comp_ID'],ascending=False) 

g = df.groupby(['Comp_ID','item_number']) 
sequence = g['date'].apply(lambda x: x.diff().fillna(0).abs().le(d)).reset_index() 
sequence.set_index('index',inplace=True) 
test = df.join(sequence) 
test.columns = ['Comp_ID','date','item_number','consecutive'] 
g = test.groupby(['Comp_ID','item_number']) 
g['consecutive'].apply(lambda x: x.idxmin() - x.idxmax())

這得到了更小的數據集所需的結果：

Comp_ID item_number 
1395  G123H   2 
     KIN005   3 
7787  KIN005   2 
Name: consecutive, dtype: int64

來源

2016-11-25 Yale Newman

誰改變了第一SKU來bdsm1000？笑起來很好 –

首先，我會建議大家產生一系列的時間，每次1一日比之前更少...

import datetime 
import pandas as pd 

def gen_prior_date(start_date): 
    yield start_date 
    while True: 
     start_date -= datetime.timedelta(days=1) 
     yield start_date

...

>>> start_date = datetime.date(2016, 11, 22) 
>>> back_in_time = gen_prior_date(start_date) 
>>> next(back_in_time) 
datetime.date(2016, 11, 22) 
>>> next(back_in_time) 
datetime.date(2016, 11, 21)

現在，我們需要，我們可以應用到每個組的功能...

def count_consec_dates(dates, start_date): 
    dates = pd.to_datetime(dates.values).date 
    dates_set = set(dates) # O(1) vs O(n) lookup times 
    back_in_time = gen_prior_date(start_date) 

    tally = 0 
    while next(back_in_time) in dates_set: # jump out on first miss 
     tally += 1 
    return tally

其餘的是容易...

>>> small_data = {'item_number': ['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'], 
...    'Comp_ID': ['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'], 
...    'date': ['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']} 
>>> small_df = pd.DataFrame(data=small_data) 
>>> start_date = datetime.date(2016, 11, 22) 
>>> groups = small_df.groupby(['Comp_ID', 'item_number']).date 
>>> groups.apply(lambda x: count_consec_dates(x, start_date)) 
Comp_ID item_number 
1395  G123H   2 
     KIN005   3 
7787  A789B   2

來源

2016-11-25 22:13:44

我能夠得到你工作的數據集，但問題是面向更大的數據集。我正在使用的實際數據集在一個月的日期中有數千個賣家和物品。 –

它不適用於該數據嗎？我不明白爲什麼不。 –

空間或運行時是否有問題？我確實留下了這個答案的numpy空間，但我試圖避免昂貴的操作，如加入或排序。 –

你可以這樣來做：

today = pd.to_datetime('2016-11-22') 

# sort DF by `date` (descending)  
x = df.sort_values('date', ascending=0) 
g = x.groupby(['Comp_ID','item_number']) 
# compare the # of days to `today` with a consecutive day# in each group 
x[(today - x['date']).dt.days == g.cumcount()].groupby(['Comp_ID','item_number']).size()

結果：

Comp_ID item_number 
1395  G123H   2 
     KIN005   3 
7787  A789B   2 
dtype: int64

PS感謝@DataSwede's for faster diff calculation！

說明：

In [124]: x[(today - x['date']).dt.days == g.cumcount()] \ 
      .sort_values(['Comp_ID','item_number','date'], ascending=[1,1,0]) 
Out[124]: 
    Comp_ID  date item_number 
8 1395 2016-11-22  G123H 
9 1395 2016-11-21  G123H 
0 1395 2016-11-22  KIN005 
1 1395 2016-11-21  KIN005 
2 1395 2016-11-20  KIN005 
5 7787 2016-11-22  A789B 
6 7787 2016-11-21  A789B

來源

2016-11-27 20:09:43 MaxU

在計算日期差異線上，您是否有選擇使用apply方法而不是groupby的原因？我得到了略微的性能增益，並通過使用dataframe計算差異來得到相同的輸出 - 'x ['diff'] =（today - x ['date']）。dt.days'，但想看如果有理由使用apply是一個更好的選擇。 – DataSwede

@DataSwede，是的，很好，趕上，謝謝！在計算不同日期（'今天'）和系列之間的差異的情況下，它會正常工作。首先，我試着計算每個組內的「差異」，並堅持使用這個緩慢的變體... – MaxU

@MaxU我遇到了一些我曾經使用的解決方案的錯誤，並切換到了這個，並且完美地工作！我很喜歡這個初學者，所以如果你可以添加更多的一步一步的解釋，當你比較一天到該groupby對象的天數會發生什麼，這將是非常感謝！想要真正瞭解發生了什麼 –

計數GROUPBY對象

回答

相關問題