2017-02-18 40 views
0

的數據被簡化爲如下:使用熊貓GROUPBY exctract數據和換位

mon site year data1 data2 
1 57598 2001 58 1383 
2 57598 2001 75 549 
1 57598 2002 118 1337 
2 57598 2002 162 2213 

1 50136 2000 -282 134 
2 50136 2000 -242 0 
1 50136 2001 -126 102 

1 50844 2000 152 411 
2 50844 2000 70 117 
1 50844 2002 -74 44 
2 50844 2002 -173 83 

我想提取的數據1和數據2並更改爲以下形式: 這是data1

 2000 2000 2001 2001 2002 2002 
     1  2  1  2 1 2 
50136 -282 -242 -126 NA NA NA 
50844 152 70 NA  NA -74 -173 
57598 58 75 NA  NA 118 162 

data2將以data1的形式保存爲新文件。 我想用pandas.groupby來操作,但代碼如下是錯誤:

df['data1'].groupby(df['year'],df['mon'],df['site']) 

是容易使用groupby去?

回答

2

我覺得首先是最好的嘗試set_indexunstack

df1 = df.set_index(['year','mon','site'])['data1'].unstack(level=[0,1]).sort_index(axis=1) 
print (df1) 
year 2000   2001   2002  
mon  1  2  1  2  1  2 
site           
50136 -282.0 -242.0 -126.0 NaN NaN NaN 
50844 152.0 70.0 NaN NaN -74.0 -173.0 
57598 NaN NaN 58.0 75.0 118.0 162.0 

但如果得到:

ValueError: Index contains duplicate entries, cannot reshape

使用另一種解決方案與groupbypivot_table

您可以使用groupbyunstack

df1 = df.groupby(['year','mon','site'])['data1'].mean().unstack(level=[0,1]) 
print (df1) 
year 2000   2001   2002  
mon  1  2  1  2  1  2 
site           
50136 -282.0 -242.0 -126.0 NaN NaN NaN 
50844 152.0 70.0 NaN NaN -74.0 -173.0 
57598 NaN NaN 58.0 75.0 118.0 162.0 

另一種可能的解決方案與pivot_table默認aggfunc這是np.mean,但可以變更爲其他功能,如aggfunc='sum',...:

print (df.pivot_table(index='site', columns=['year','mon'], values='data1', aggfunc=np.mean)) 
year 2000   2001   2002  
mon  1  2  1  2  1  2 
site           
50136 -282.0 -242.0 -126.0 NaN NaN NaN 
50844 152.0 70.0 NaN NaN -74.0 -173.0 
57598 NaN NaN 58.0 75.0 118.0 162.0 

用於寫文件csv最後使用DataFrame.to_csv

df1.to_csv('file_out.csv') 
0

爲了讓DF的形狀,其中你需要它:

result = df.groupby(['site','mon','year'])['data1'].mean().unstack().unstack() 
Out[310]: 
year 2000   2001   2002  
mon  1  2  1  2  1  2 
site           
50136 -282.0 -242.0 -126.0 NaN NaN NaN 
50844 152.0 70.0 NaN NaN -74.0 -173.0 
57598 NaN NaN 58.0 75.0 118.0 162.0 

將它保存到CSV:

df.groupby(['site','mon','year'])['data1'].mean().unstack().unstack().to_csv('data1.csv') 
df.groupby(['site','mon','year'])['data2'].mean().unstack().unstack().to_csv('data2.csv')