2015-07-22 136 views
0

第一次在這裏發佈海報,所以如果我沒有完全正確地回答這個問題,請致歉。花了很多年在Excel和PowerPivot中操縱數據,但是當前的項目需要更多的提升功能。一直在看熊貓,認爲它可以勝任處理這項工作,但我被卡住了。PANDAS:將數據幀中的計算數據合併到主數據幀中

我試圖計算的天購買的數量爲每一個客戶

我最初的數據幀是這樣的:

customer_id date  invoice_amt 
0 101A  21/03/2012 654.76  
1 101A  1/02/2012 234.45  
2 102A  23/01/2012 99.45  
3 104B  18/12/2011 767.63  
4 101A  9/12/2011 124.76  
5 104B  27/11/2011 346.87  
6 102A  18/11/2011 652.65  
7 104B  12/10/2011 765.21  
8 101A  1/10/2011 275.76  
9 102A  21/09/2011 532.21 

我的目標數據框的樣子:

customer_id date  invoice_amt days_since 
0 101A  21/03/2012 654.76  49 
1 101A  1/02/2012 234.45  54 
2 102A  23/01/2012 99.45  66 
3 104B  18/12/2011 767.63  21 
4 101A  9/12/2011 124.76  69 
5 104B  27/11/2011 346.87  46 
6 102A  18/11/2011 652.65  58 
7 104B  12/10/2011 765.21  NaN 
8 101A  1/10/2011 275.76  NaN 
9 102A  21/09/2011 532.21  NaN 

我已經到了能夠計算每個分組數據框中days_since值的程度,但不知道如何將值返回到主數據框(data_df)

任何幫助將是非常讚賞...謝謝

import pandas as pd 
#import numpy as np 

#dataframe data note: no_days_since_last_purchase hard coded for testing purposes 
my_data = {'customer_id' : ['101A', '101A', '102A', '104B', '101A', '104B', '102A', '104B', '101A', '102A' ], 
      'date' : ['20120321','20120201','20120123','20111218','20111209','20111127','20111118','20111012','20111001','20110921'], 
      'invoice_amt' : [654.76, 234.45, 99.45, 767.63, 124.76, 346.87, 652.65, 765.21, 275.76, 532.21 ], 
      'no_days_since_last_purchase' : ['49', '54', '66', '21', '69', '46', '58', 'NaN', 'NaN', 'NaN']} 

data_df = pd.DataFrame(my_data).sort_index(by='date',ascending=True) 

#convert date str to date type 
data_df['date'] = pd.to_datetime(data_df['date'].astype(str),format='%Y%m%d') 

#group dataframe by customer_id 
grouped_data = data_df.groupby(['customer_id'])  

#for each row in each grouped dataframe calculate the difference in days between current and previous 
#if there is no previous then use 2000-01-01 then convert to integer 
for customer_id, group in grouped_data: 
    group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]') 
    print group 

OUTPUT:

customer_id  date invoice_amt no_days_since_last_purchase days_since 
8  101A 2011-10-01  275.76       NaN  4291 
4  101A 2011-12-09  124.76       69   69 
1  101A 2012-02-01  234.45       54   54 
0  101A 2012-03-21  654.76       49   49 
    customer_id  date invoice_amt no_days_since_last_purchase days_since 
9  102A 2011-09-21  532.21       NaN  4281 
6  102A 2011-11-18  652.65       58   58 
2  102A 2012-01-23  99.45       66   66 
    customer_id  date invoice_amt no_days_since_last_purchase days_since 
7  104B 2011-10-12  765.21       NaN  4302 
5  104B 2011-11-27  346.87       46   46 
3  104B 2011-12-18  767.63       21   21 

哦,我得到 SettingWithCopyWarning: 值正試圖在一組從DataFrame中複製切片。 嘗試使用.loc [row_indexer,col_indexer] =值代替

有關我應該如何避免此警告的任何想法也將不勝感激。

+0

的[從數據框中設置上的一個切片的副本值(可能重複http://stackoverflow.com/questions/31468176/setting-values-on-a -copy對的一排從 - 一個非數據幀) – firelynx

回答

0
df_container = [] 
for customer_id, group in grouped_data: 
    group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]') 
    df_container.append(group) 

data_df = pd.concat(df_container) 

也許這就是你想要的嗎?

customer_id  date invoice_amt no_days_since_last_purchase days_since 
8  101A 2011-10-01  275.76       NaN  4291 
4  101A 2011-12-09  124.76       69   69 
1  101A 2012-02-01  234.45       54   54 
0  101A 2012-03-21  654.76       49   49 
9  102A 2011-09-21  532.21       NaN  4281 
6  102A 2011-11-18  652.65       58   58 
2  102A 2012-01-23  99.45       66   66 
7  104B 2011-10-12  765.21       NaN  4302 
5  104B 2011-11-27  346.87       46   46 
3  104B 2011-12-18  767.63       21   21 
1

使用transform產生一系列與它的標記對齊到原來的DF,就可以指定爲新的一列,此外,您不能使用投datetime64[ns]astypetimedelta[D]讓你有一個額外的步驟來調用to_timedelta

In [193]: 
data_df['days_since'] = data_df.groupby(['customer_id'])['date'].transform(lambda x: x - x.shift().fillna(pd.datetime(2000,1,1))) 
data_df['days_since'] = pd.to_timedelta(data_df['days_since']) 
data_df 

Out[193]: 
    customer_id  date invoice_amt no_days_since_last_purchase days_since 
9  102A 2011-09-21  532.21       NaN 4281 days 
8  101A 2011-10-01  275.76       NaN 4291 days 
7  104B 2011-10-12  765.21       NaN 4302 days 
6  102A 2011-11-18  652.65       58  58 days 
5  104B 2011-11-27  346.87       46  46 days 
4  101A 2011-12-09  124.76       69  69 days 
3  104B 2011-12-18  767.63       21  21 days 
2  102A 2012-01-23  99.45       66  66 days 
1  101A 2012-02-01  234.45       54  54 days 
0  101A 2012-03-21  654.76       49  49 days 

編輯

其實你可以撥打to_timedelta對返回的系列,像這樣:

data_df['days_since'] = pd.to_timedelta(data_df.groupby(['customer_id'])['date'].transform(lambda x: x - x.shift().fillna(pd.datetime(2000,1,1))))