2017-03-01 54 views
2

我只想將cumsum應用於1個特定列,因爲我在其他列中必須保持相同的其他值。僅適用於1列python的累積總和

這是我至今

df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum() 

然而,在我所有的列的這個腳本會導致我的大熊貓DF將累積的腳本。必須累積的唯一一列是data

按照要求,這裏是一些樣本數據:

df = pd.DataFrame({'ID': ["880022443344556677787", "880022443344556677782", "880022443344556677787", 
          "880022443344556677782", "880022443344556677787", "880022443344556677782", 
          "880022443344556677781"], 
        'Month': ["201701", "201701", "201702", "201702", "201703", "201703", "201703"], 
        'Usage': [20, 40, 100, 50, 30, 30, 2000], 
        'Sec': [10, 15, 20, 1, 5, 6, 30]}) 

         ID Month Sec Usage 
0 880022443344556677787 201701 10  20 
1 880022443344556677782 201701 15  40 
2 880022443344556677787 201702 20 100 
3 880022443344556677782 201702 1  50 
4 880022443344556677787 201703 5  30 
5 880022443344556677782 201703 6  30 
6 880022443344556677781 201703 30 2000 

所需的輸出

     ID Month Sec Usage 
0 880022443344556677787 201701 10  20 
1 880022443344556677782 201701 15  40 
2 880022443344556677787 201702 20 120 
3 880022443344556677782 201702 1  90 
4 880022443344556677787 201703 5 150 
5 880022443344556677782 201703 6 120 
6 880022443344556677781 201703 30 2000 

回答

2

我想你的cols哪裏都不需要cumsum需要set_index - 我動態地list comprehension找到他們:

cumsum_col = 'Usage' 
df1 = df.groupby(by=['ID','Month'], sort=False).sum() 
cols = [col for col in df1.columns if col != cumsum_col] 

df1 = df1.set_index(cols, append=True).groupby(level=[0]).cumsum().reset_index() 
print (df1) 
         ID Month Sec Usage 
0 880022443344556677787 201701 10  20 
1 880022443344556677782 201701 15  40 
2 880022443344556677787 201702 20 120 
3 880022443344556677782 201702 1  90 
4 880022443344556677787 201703 5 150 
5 880022443344556677782 201703 6 120 
6 880022443344556677781 201703 30 2000 

編輯:

cumsum_col = 'Usage' 
df2 = df.groupby(by=['ID','Month'], sort=False).sum() 
cols = [col for col in df2.columns if col != cumsum_col] 
df1 = df2.set_index(cols, append=True).groupby(level=[0]).cumsum() 
df1 = df2.assign(Usage_cumsum = df1.reset_index(level=2, drop=True)).reset_index() 
print (df1) 
         ID Month Sec Usage Usage_cumsum 
0 880022443344556677787 201701 10  20   20 
1 880022443344556677782 201701 15  40   40 
2 880022443344556677787 201702 20 100   120 
3 880022443344556677782 201702 1  50   90 
4 880022443344556677787 201703 5  30   150 
5 880022443344556677782 201703 6  30   120 
6 880022443344556677781 201703 30 2000   2000 

EDIT1:

在您的樣本數據不是骨料sum,所以數據是一個比特修改(溶液是類似的,但不與另一個相同):

df = pd.DataFrame({'ID': ["880022443344556677787", "880022443344556677782", "880022443344556677787", 
          "880022443344556677782", "880022443344556677787", "880022443344556677782", 
          "880022443344556677781"], 
        'Month': ["201701", "201701", "201701", "201702", "201703", "201701", "201703"], 
        'Usage': [20, 40, 100, 50, 30, 30, 2000], 
        'Sec': [10, 15, 20, 1, 5, 6, 30]}) 

print (df) 
         ID Month Sec Usage 
0 880022443344556677787 201701 10  20 
1 880022443344556677782 201701 15  40 
2 880022443344556677787 201701 20 100 
3 880022443344556677782 201702 1  50 
4 880022443344556677787 201703 5  30 
5 880022443344556677782 201701 6  30 
6 880022443344556677781 201703 30 2000 
#aggregate sum to all columns 
df1 = df.groupby(['ID', 'Month']).sum() 
print (df1) 
           Sec Usage 
ID     Month    
880022443344556677781 201703 30 2000 
880022443344556677782 201701 21  70 
         201702 1  50 
880022443344556677787 201701 30 120 
         201703 5  30 

#aggregate cumcum to Usage column only 
s = df1.groupby(level=0)['Usage'].cumsum() 
print (s) 
ID      Month 
880022443344556677781 201703 2000 
880022443344556677782 201701  70 
         201702  120 
880022443344556677787 201701  120 
         201703  150 
Name: Usage, dtype: int64 
#join cumsum series to aggregate df1 
df3 = df1.join(s, rsuffix='_cumsum').reset_index() 
print (df3) 
         ID Month Sec Usage Usage_cumsum 
0 880022443344556677781 201703 30 2000   2000 
1 880022443344556677782 201701 21  70   70 
2 880022443344556677782 201702 1  50   120 
3 880022443344556677787 201701 30 120   120 
4 880022443344556677787 201703 5  30   150 
+0

是否可以使用cum sum數據添加附加列而不是替換它? –

+0

不知道發生了什麼,但是當我將它應用到我的df時,您的第一種方法正在工作,但帶有附加列的cumsum的新方法以'NaN'值返回。你知道發生了什麼嗎? –

+1

所以看起來你的真實數據有更多的列,所以需要改變'df1.reset_index(level = [2,3,4],drop = True)' - 每個級別的額外列。但我修改了另一個解決方案,給了我一個。 – jezrael

3

考慮據幀df

df = pd.DataFrame(dict(
     name=list('aaaaaaaabbbbbbbb'), 
     day=np.tile(np.arange(2).repeat(4), 2), 
     data=np.arange(16) 
    )) 

首先,由groupby語句後命名的列中cumsum在一個特定的列執行。

其次,你可以添加此回數據幀dfjoin

d2 = df.groupby(['name', 'day']).data.sum().groupby(level=0).cumsum() 

df.join(d2, on=['name', 'day'], rsuffix='_cum') 

    data day name data_cum 
0  0 0 a   6 
1  1 0 a   6 
2  2 0 a   6 
3  3 0 a   6 
4  4 1 a  28 
5  5 1 a  28 
6  6 1 a  28 
7  7 1 a  28 
8  8 0 b  38 
9  9 0 b  38 
10 10 0 b  38 
11 11 0 b  38 
12 12 1 b  92 
13 13 1 b  92 
14 14 1 b  92 
15 15 1 b  92 
1

你已經可以做到的累積和('cumsum')爲聚合到df.groupby。您需要將其作爲字符串'cumsum'作爲「數據」列的聚合函數。

df.groupby(['name','day']).agg({'data': 'cumsum'}) 
+1

這是錯誤的,因爲首先需要聚集'sum',然後groupby由第一級僅用於聚合cumsum。 – jezrael