熊貓GROUPBY +重採樣/從開始

TimeGrouper變更過幾個月我有員工工資數據的數據幀（樣本如下），其中「日期」是指當僱員的工資生效：熊貓GROUPBY +重採樣/從開始

Employee Date  Salary 
PersonA  1/1/2016 $50000 
PersonB  3/5/2014 $65000 
PersonB  3/1/2015 $75000 
PersonB  3/1/2016 $100000 
PersonC  5/15/2010 $75000 
PersonC  6/3/2011 $100000 
PersonC  3/10/2012 $110000 
PersonC  9/5/2012 $130000 
PersonC  3/1/2013 $150000 
PersonC  3/1/2014 $200000

在這例如，PersonA今年開始以5萬美元的價格出售，而PersonC已經在公司工作了一段時間，並且自2010年5月15日開始以來已經獲得多次增加。

我需要的Date列轉換爲Months from Start，單個員工的基礎上，在那裏Months from Start將在m個月（由我指定）的增量上。例如，對於PersonB，假設m=12，其結果必然是：

Employee Months From Start Salary 
PersonB  0     $65000 
PersonB  12     $65000 
PersonB  24     $75000

這意味着，在一個月0（就業起點），PersonB過的$ 65,000的工資; 12個月後，他的薪水爲65,000美元，24個月後他的薪水爲75,000美元。請注意，下一個增量（36個月）將在PersonB的轉換後的數據框中出現NOT，因爲該持續時間超過了PersonB的工作時間（將來會有）。

再次請注意，我希望能夠將m調整爲任意月份增量。如果我想爲6個月（m=6）的增量，其結果必然是：

Employee Months From Start Salary 
PersonB  0     $65000 
PersonB  6     $65000 
PersonB  12     $65000 
PersonB  18     $75000 
PersonB  24     $100000 
PersonB  30     $100000

最後一步，我也想包括僱員的工資作爲今天的轉化數據幀的。再次使用PersonB，並假設m=6，這意味着其結果將是：

Employee Months From Start Salary 
PersonB  0     $65000 
PersonB  6     $65000 
PersonB  12     $65000 
PersonB  18     $75000 
PersonB  24     $100000 
PersonB  30     $100000 
PersonB  32.92    $100000 <--added (today is 32.92 months from start)

問題：實現：是有一種編程方式（groupby，resample，或TimeGrouper我假設使用的至少一個）上述所需的數據幀？

注意：您可以假設所有員工都活躍（尚未離開公司）。

來源

2016-12-01 NickBraunagel

非常感謝所提供的答案。不幸的是，所有的答案都有點「關閉」，並沒有完全達到目標。我最終在列表解析中嵌套了兩個for循環來實現目標。

來源

2016-12-03 20:07:08 NickBraunagel

您可以使用DataFrames

>>> import pandas as pd 
>>> df = pd.DataFrame([['PersonC','5/15/2010',75000],['PersonC','7/3/2011',100000],['PersonB','3/5/2014',65000],['PersonB','3/1/2015',75000],['PersonB','3/1/2016',100000]],columns=['Employee','Date','Salary']) 
>>> df['Date']= pd.to_datetime(df['Date']) 
>>> df 
    Employee  Date Salary 
0 PersonC 2010-05-15 75000 
1 PersonC 2011-07-03 100000 
2 PersonB 2014-03-05 65000 
3 PersonB 2015-03-01 75000 
4 PersonB 2016-03-01 100000 
>>> satrt_date = df.groupby('Employee')['Date'].min().to_frame().rename(columns={'Date':'Start Date'}) 
>>> satrt_date['Employee'] = satrt_date.index 
>>> df = df.merge(satrt_date,how='left', on= 'Employee') 
>>> df['Months From Start'] = df['Date']-df['Start Date'] 
>>> df['Months From Start'] = df['Months From Start'].apply(lambda x: x.days) 
>>> df['Months From Start']= df['Months From Start'].apply(lambda x: (x/30) - (x/30)%6) 
>>> df 
    Employee  Date Salary Start Date Months From Start 
0 PersonC 2010-05-15 75000 2010-05-15     0 
1 PersonC 2011-07-03 100000 2010-05-15     12 
2 PersonB 2014-03-05 65000 2014-03-05     0 
3 PersonB 2015-03-01 75000 2014-03-05     12 
4 PersonB 2016-03-01 100000 2014-03-05     24

這裏groupby和merge功能，你可以用一個叫m變量替換6並分配任意值給它

來源

2016-12-01 20:07:34 Ali

OK，所以的第一部分回答我會做這樣的事情...

import numpy as np 
import pandas as pd 

df = pd.DataFrame({ 
    'Employee': ['PersonA', 'PersonB', 'PersonB', 'PersonB', 'PersonC', 'PersonC', 'PersonC', 'PersonC', 'PersonC', 'PersonC'], 
    'Date': ['1/1/2016', '3/5/2014', '3/1/2015', '3/1/2016', '5/15/2010', '6/3/2011', '3/10/2012', '9/5/2012', '3/1/2013', '3/1/2014'], 
    'Salary': [50000 , 65000 , 75000 , 100000 , 75000 , 100000 , 110000 , 130000 , 150000 , 200000] 
}) 

df.Date = pd.to_datetime(df.Date) 

m = 6 
emp_groups = df.groupby('Employee') 
df['months_from_start'] = df.Date - emp_groups.Date.transform(min) 
df.months_from_start = df.months_from_start.dt.days/30 // m * m

m可以是你想要的任何東西。我正在計算min之間的日期，然後除以一個月內的近似天數，然後進行一點整數除法以「舍入」到所需的窗口大小。

這會給你這樣的事情...

 Date Employee Salary months_from_start 
0 2016-01-01 PersonA 50000     0 
1 2014-03-05 PersonB 65000     0 
2 2015-03-01 PersonB 75000     12 
3 2016-03-01 PersonB 100000     24 
4 2010-05-15 PersonC 75000     0 
5 2011-06-03 PersonC 100000     12 
6 2012-03-10 PersonC 110000     18 
7 2012-09-05 PersonC 130000     24 
8 2013-03-01 PersonC 150000     30 
9 2014-03-01 PersonC 200000     42

第二部分是有點棘手。我將創建一個新的DF和CONCAT到第一...

last_date_df = emp_groups.last() 
last_date_df.months_from_start = (last_date_df.Date - emp_groups.first().Date).dt.days/30 
last_date_df.reset_index(inplace=True) 

pd.concat([df, last_date_df], axis=0)

讓你...

 Date Employee Salary months_from_start 
0 2016-01-01 PersonA 50000   0.000000 
1 2014-03-05 PersonB 65000   0.000000 
2 2015-03-01 PersonB 75000   12.000000 
3 2016-03-01 PersonB 100000   24.000000 
4 2010-05-15 PersonC 75000   0.000000 
5 2011-06-03 PersonC 100000   12.000000 
6 2012-03-10 PersonC 110000   18.000000 
7 2012-09-05 PersonC 130000   24.000000 
8 2013-03-01 PersonC 150000   30.000000 
9 2014-03-01 PersonC 200000   42.000000 
0 2016-01-01 PersonA 50000   0.000000 
1 2016-03-01 PersonB 100000   24.233333 
2 2014-03-01 PersonC 200000   46.200000

來源

2016-12-01 20:10:13

您可以結合GROUP_BY和重新取樣做。要使用resample，您需要將日期作爲索引。

df.index = pd.to_datetime(df.Date) 
df.drop('Date',axis = 1, inplace = True)

然後：

df.groupby('Employee').resample('6m').pad()

在這種情況下，我使用6個月期間。注意它會得到每個月的最後一天，我希望這不會是一個問題。然後，你將有：

Employee Date  Salary 
0 PersonA 2016-01-31 $50000 
1 PersonB 2014-03-31 $65000 
2 PersonB 2014-09-30 $65000 
3 PersonB 2015-03-31 $75000 
4 PersonB 2015-09-30 $75000 
5 PersonB 2016-03-31 $100000 
6 PersonC 2010-05-31 $75000 
7 PersonC 2010-11-30 $75000 
8 PersonC 2011-05-31 $75000 
9 PersonC 2011-11-30 $100000 
10 PersonC 2012-05-31 $110000 
11 PersonC 2012-11-30 $130000 
12 PersonC 2013-05-31 $150000 
13 PersonC 2013-11-30 $150000 
14 PersonC 2014-05-31 $200000

現在，您可以創建「個月以來開始」列（cumcount功能檢查中出現的組內的每一行的順序）。記得乘以你使用的每個時期的月數（在這種情況下，6）：

df['Months since started'] = df.groupby('Employee').cumcount()*6 

    Employee Date  Salary  Months since started 
0 PersonA 2016-01-31 $50000     0 
1 PersonB 2014-03-31 $65000     0 
2 PersonB 2014-09-30 $65000     6 
3 PersonB 2015-03-31 $75000     12 
4 PersonB 2015-09-30 $75000     18 
5 PersonB 2016-03-31 $100000     24 
6 PersonC 2010-05-31 $75000     0 
7 PersonC 2010-11-30 $75000     6 
8 PersonC 2011-05-31 $75000     12 
9 PersonC 2011-11-30 $100000     18 
10 PersonC 2012-05-31 $110000     24 
11 PersonC 2012-11-30 $130000     30 
12 PersonC 2013-05-31 $150000     36 
13 PersonC 2013-11-30 $150000     42 
14 PersonC 2014-05-31 $200000     48

希望它有幫助！

來源

2016-12-01 21:11:21

感謝您的提示。我遇到的一個問題是，在整個數據集中，一些員工的薪水在同一天生效，因此將'index'設置爲'df.Date'違反了'resample'顯然需要的唯一索引要求（我收到此錯誤：'ValueError：不能從重複軸重新索引'）。如果您有任何想法，請告知我。 – NickBraunagel

熊貓GROUPBY +重採樣/從開始

回答

相關問題