第一次在這裏發佈海報,所以如果我沒有完全正確地回答這個問題,請致歉。花了很多年在Excel和PowerPivot中操縱數據,但是當前的項目需要更多的提升功能。一直在看熊貓,認爲它可以勝任處理這項工作,但我被卡住了。PANDAS:將數據幀中的計算數據合併到主數據幀中
我試圖計算的天購買的數量爲每一個客戶
我最初的數據幀是這樣的:
customer_id date invoice_amt
0 101A 21/03/2012 654.76
1 101A 1/02/2012 234.45
2 102A 23/01/2012 99.45
3 104B 18/12/2011 767.63
4 101A 9/12/2011 124.76
5 104B 27/11/2011 346.87
6 102A 18/11/2011 652.65
7 104B 12/10/2011 765.21
8 101A 1/10/2011 275.76
9 102A 21/09/2011 532.21
我的目標數據框的樣子:
customer_id date invoice_amt days_since
0 101A 21/03/2012 654.76 49
1 101A 1/02/2012 234.45 54
2 102A 23/01/2012 99.45 66
3 104B 18/12/2011 767.63 21
4 101A 9/12/2011 124.76 69
5 104B 27/11/2011 346.87 46
6 102A 18/11/2011 652.65 58
7 104B 12/10/2011 765.21 NaN
8 101A 1/10/2011 275.76 NaN
9 102A 21/09/2011 532.21 NaN
我已經到了能夠計算每個分組數據框中days_since值的程度,但不知道如何將值返回到主數據框(data_df)
任何幫助將是非常讚賞...謝謝
import pandas as pd
#import numpy as np
#dataframe data note: no_days_since_last_purchase hard coded for testing purposes
my_data = {'customer_id' : ['101A', '101A', '102A', '104B', '101A', '104B', '102A', '104B', '101A', '102A' ],
'date' : ['20120321','20120201','20120123','20111218','20111209','20111127','20111118','20111012','20111001','20110921'],
'invoice_amt' : [654.76, 234.45, 99.45, 767.63, 124.76, 346.87, 652.65, 765.21, 275.76, 532.21 ],
'no_days_since_last_purchase' : ['49', '54', '66', '21', '69', '46', '58', 'NaN', 'NaN', 'NaN']}
data_df = pd.DataFrame(my_data).sort_index(by='date',ascending=True)
#convert date str to date type
data_df['date'] = pd.to_datetime(data_df['date'].astype(str),format='%Y%m%d')
#group dataframe by customer_id
grouped_data = data_df.groupby(['customer_id'])
#for each row in each grouped dataframe calculate the difference in days between current and previous
#if there is no previous then use 2000-01-01 then convert to integer
for customer_id, group in grouped_data:
group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]')
print group
OUTPUT:
customer_id date invoice_amt no_days_since_last_purchase days_since
8 101A 2011-10-01 275.76 NaN 4291
4 101A 2011-12-09 124.76 69 69
1 101A 2012-02-01 234.45 54 54
0 101A 2012-03-21 654.76 49 49
customer_id date invoice_amt no_days_since_last_purchase days_since
9 102A 2011-09-21 532.21 NaN 4281
6 102A 2011-11-18 652.65 58 58
2 102A 2012-01-23 99.45 66 66
customer_id date invoice_amt no_days_since_last_purchase days_since
7 104B 2011-10-12 765.21 NaN 4302
5 104B 2011-11-27 346.87 46 46
3 104B 2011-12-18 767.63 21 21
哦,我得到 SettingWithCopyWarning: 值正試圖在一組從DataFrame中複製切片。 嘗試使用.loc [row_indexer,col_indexer] =值代替
有關我應該如何避免此警告的任何想法也將不勝感激。
的[從數據框中設置上的一個切片的副本值(可能重複http://stackoverflow.com/questions/31468176/setting-values-on-a -copy對的一排從 - 一個非數據幀) – firelynx