2017-07-18 94 views
-1

我無法以更pythonic和高效的方式編寫此代碼。我試圖按customerid對觀察進行分組,並計算在過去1,7天和30天內客戶被拒絕的每次觀察的次數。計算組中過去x天的值

t = pd.DataFrame({'customerid': [1,1,1,3,3], 
       'leadid': [10,11,12,13,14], 
       'postdate': ["2017-01-25 10:55:25.727", "2017-02-02 10:55:25.727", "2017-02-27 10:55:25.727", "2017-01-25 10:55:25.727", "2017-01-25 11:55:25.727"], 
       'post_status': ['Declined', 'Declined', 'Declined', 'Declined', 'Declined']}) 
t['postdate'] = pd.to_datetime(t['postdate']) 

這裏是輸出:

customerid leadid post_status postdate 
1 10 Declined 2017-01-25 10:55:25.727 
1 11 Declined 2017-02-02 10:55:25.727 
1 12 Declined 2017-02-27 10:55:25.727 
3 13 Declined 2017-01-25 10:55:25.727 
3 14 Declined 2017-01-25 11:55:25.727 

我目前的解決方案是非常緩慢:

final = [] 
for customer in t['customerid'].unique(): 

    temp = t[(t['customerid']==customer) & (t['post_status']=='Declined')].copy() 

    for i, row in temp.iterrows(): 
     date = row['postdate'] 
     final.append({ 
      'leadid': row['leadid'], 
      'decline_1': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=1))].shape[0]-1, 
      'decline_7': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=7))].shape[0]-1, 
      'decline_30': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=30))].shape[0]-1 
     }) 

預期的輸出如下所示:

decline_1 decline_30 decline_7 leadid 
0 0 0 10 
0 1 0 11 
0 1 0 12 
0 0 0 13 
1 1 1 14 

我想象我需要某種double groupby在哪裏我遍歷組中的每一行,但除了這個需要很長時間才能完成的double for循環外,我無法獲得任何工作。

任何幫助,將不勝感激。

回答

0

你可以嘗試groupbytransform並使用的事實布爾數組的總和爲True S上的號碼,這樣你就不需要產生額外的數據幀每次做這樣的事情temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=7))].shape[0]-1

def find_declinations(df, period): 
    results = pd.Series(index=df.index, name=period) 
    for index, date in df.items(): 
     time_range = df.between(date - period, date) 
     results[index] = time_range.sum() - 1 
    return results.fillna(0).astype(int) 

,並調用它

results = pd.DataFrame(index=t.index) 
period=pd.to_timedelta(1, 'd') 
for days in [1, 7, 30]: 
    results['decline%i'% days] = t.groupby('customerid')[['postdate']].transform(lambda x: find_declinations(x, pd.to_timedelta(days, 'd'))) 
results.index = t['leadid'] 

結果

decline1 decline7 decline30 
leadid   
10 0 0 0 
11 0 0 1 
12 0 0 1 
13 0 0 0 
14 1 1 1 

略有不同的方法

這appoach每做一期GROUPBY。你可以只做1個GROUPBY加快點,然後計算各個時期對各組

def find_declinations_df(df, periods = [1, 7, 30, 60]): 
#  print(periods, type(df), df) 
    results = pd.DataFrame(index=pd.DataFrame(df).index, columns=periods) 
    for period in periods: 
     for index, date in df['postdate'].items(): 
      time_range = df['postdate'].between(date - pd.to_timedelta(period, 'd'), date) 
      results.loc[index, period] = time_range.sum() - 1 
    return results.fillna(0).astype(int) 

results = pd.concat(find_declinations_df(group[1]) for group in t.groupby('customerid')) 
results['leadid'] = t['leadid'] 

結果

1 7 30 60 leadid 
0 0 0 0 0 10 
1 0 0 1 1 11 
2 0 0 1 2 12 
3 0 0 0 0 13 
4 1 1 1 1 14 
+0

你真是個該死的天才!謝謝! – fcol