2017-07-20 80 views
1

我有一個數據幀my_df,那麼我想創建一個新的數據幀new_df。每個new_df列由groupbymy_id創建,然後取my_df中列的max熊貓:爲多列數據框實現groupby +聚合的優雅方法?

下面是我的代碼,它工作正常。但是,我想知道有沒有更好的方法?特別是在未來,我將處理數百列而不是僅僅6列?非常感謝!

tmp_df1 = my_df.groupby(['my_id'], as_index=False).col_A.agg({"max_A": "max"}) 
tmp_df2 = my_df.groupby(['my_id'], as_index=False).col_B.agg({"max_B": "max"}) 
tmp_df3 = my_df.groupby(['my_id'], as_index=False).col_C.agg({"max_C": "max"}) 
tmp_df4 = my_df.groupby(['my_id'], as_index=False).col_D.agg({"max_D": "max"}) 
tmp_df5 = my_df.groupby(['my_id'], as_index=False).col_E.agg({"max_E": "max"}) 
tmp_df6 = my_df.groupby(['my_id'], as_index=False).col_F.agg({"max_F": "max"}) 

combine_df1 = pd.merge(tmp_df1,tmp_df2,how="inner",on=['my_id']) 
combine_df2 = pd.merge(combine_df1,tmp_df3,how="inner",on=['my_id']) 
combine_df3 = pd.merge(combine_df2,tmp_df4,how="inner",on=['my_id']) 
combine_df4 = pd.merge(combine_df3,tmp_df5,how="inner",on=['my_id']) 
new_df = pd.merge(combine_df4,tmp_df6,how="inner",on=['my_id']) 

回答

1

考慮例如數據幀my_df

np.random.seed([3,1415]) 

my_df = pd.DataFrame(dict(
    my_id=list('WXYZ') * 4, 
    col_A=np.random.randint(10, size=16), 
    col_B=np.random.randint(10, size=16), 
    col_C=np.random.randint(10, size=16), 
    col_D=np.random.randint(10, size=16), 
    col_E=np.random.randint(10, size=16), 
    col_F=np.random.randint(10, size=16), 
)) 

my_df 

    col_A col_B col_C col_D col_E col_F my_id 
0  0  3  6  1  9  3  W 
1  2  2  4  5  8  5  X 
2  7  4  7  2  0  8  Y 
3  3  3  6  8  4  5  Z 
4  8  3  2  2  3  1  W 
5  7  6  6  4  0  5  X 
6  0  7  6  7  4  1  Y 
7  6  7  5  6  1  4  Z 
8  8  4  2  9  5  3  W 
9  6  5  8  4  8  9  X 
10  0  3  7  2  6  5  Y 
11  2  7  5  4  0  5  Z 
12  0  5  8  6  8  7  W 
13  4  9  4  3  7  0  X 
14  9  8  7  8  4  3  Y 
15  7  7  6  3  6  2  Z 

你只想用groupbymax

new_df = my_df.groupby('my_id', as_index=False).max() 
new_df 

    my_id col_A col_B col_C col_D col_E col_F 
0  W  8  5  8  9  9  7 
1  X  7  9  8  5  8  9 
2  Y  9  8  7  8  6  8 
3  Z  7  7  6  8  6  5 

如果你想改變列的名稱:

my_df.groupby('my_id').max() \ 
    .rename(columns=lambda x: x.replace('col_', '')) \ 
    .add_suffix('_max').reset_index() 

    my_id A_max B_max C_max D_max E_max F_max 
0  W  8  5  8  9  9  7 
1  X  7  9  8  5  8  9 
2  Y  9  8  7  8  6  8 
3  Z  7  7  6  8  6  5 

或者

my_df.groupby('my_id').max() \ 
    .rename(columns=lambda x: x.rsplit('_', 1)[1] + '_max').reset_index() 

    my_id A_max B_max C_max D_max E_max F_max 
0  W  8  5  8  9  9  7 
1  X  7  9  8  5  8  9 
2  Y  9  8  7  8  6  8 
3  Z  7  7  6  8  6  5 
+0

是否有可能在過程中給這些new_df列新的名字呢?即A_max而不是col_A,B_max而不是col_B ...等等?我正在嘗試以後再次手動重命名每一列......謝謝! – Edamame

+0

@Edamame我已更新我的帖子。 – piRSquared