2017-04-18 48 views
1

我有一些SQL數據,我正在分組和執行某些聚合。它工作得很好:在羣組之後填寫缺失的行由

grouped = df.groupby(['a', 'b']) 
agged = grouped.aggregate({ 
    c: [numpy.sum, numpy.mean, numpy.size], 
    d: [numpy.sum, numpy.mean, numpy.size] 
}) 

  c       d 
     sum  mean size  sum   mean size 
a b 
25 20 107.0 0.804511 133.0 5328000 40060.150376 133 
    21 110.0 0.774648 142.0 6031000 42471.830986 142 
    23 126.0 0.792453 159.0 8795000 55314.465409 159 
    24 72.0 0.947368 76.0 2920000 38421.052632 76 
    25 54.0 0.818182 66.0 2570000 38939.393939 66 
26 23 126.0 0.792453 159.0 8795000 55314.465409 159 

但我想,以填補所有處於a=25行而不是在a=26零。換句話說,就像這樣:

  c       d 
     sum  mean size  sum   mean size 
a b 
25 20 107.0 0.804511 133.0 5328000 40060.150376 133 
    21 110.0 0.774648 142.0 6031000 42471.830986 142 
    23 126.0 0.792453 159.0 8795000 55314.465409 159 
    24 72.0 0.947368 76.0 2920000 38421.052632 76 
    25 54.0 0.818182 66.0 2570000 38939.393939 66 
26 20  0   0  0  0    0 0 
    21  0   0  0  0    0 0 
    23 126.0 0.792453 159.0 8795000 55314.465409 159 
    24  0   0  0  0    0 0 
    25  0   0  0  0    0 0 

我該怎麼做?

+1

您的輸出不匹配你要求。 'a == 25'將是整個第一塊。爲什麼你要在'a == 6'組中清零行? – piRSquared

+0

我可能沒有解釋得很清楚。我基本上想要在分組完成後用0填寫任何缺失的「行」,這樣在別處使用時數據可以更「完整」。 –

+0

[Pandas分類子組0的計數]的可能重複(http:// stackoverflow.com/questions/43097140/pandas-category-sub-group-0-counts) – gereleth

回答

2

考慮數據框df

df = pd.DataFrame(
    np.random.randint(10, size=(6, 6)), 
    pd.MultiIndex.from_tuples(
     [(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)], 
     names=['a', 'b'] 
    ), 
    pd.MultiIndex.from_product(
     [['c', 'd'], ['sum', 'mean', 'size']] 
    ) 
) 

     c    d   
     sum mean size sum mean size 
a b        
25 20 8 3 5 5 0 2 
    21 3 7 8 9 2 7 
    23 2 1 3 2 5 4 
    24 9 0 1 7 1 6 
    25 1 9 3 5 8 8 
26 23 8 8 4 8 0 5 

您可以快速從unstack(fill_value=0)笛卡爾乘積,隨後stack

df.unstack(fill_value=0).stack() 

     c    d   
     mean size sum mean size sum 
a b        
25 20 3 5 8 0 2 5 
    21 7 8 3 2 7 9 
    23 1 3 2 5 4 2 
    24 0 1 9 1 6 7 
    25 9 3 1 8 8 5 
26 20 0 0 0 0 0 0 
    21 0 0 0 0 0 0 
    23 8 4 8 0 5 8 
    24 0 0 0 0 0 0 
    25 0 0 0 0 0 0 

注恢復所有丟失的行:使用fill_value=0保留dtypeint。沒有它,開拆的時候,空白得到填補與NaNdtypes地轉化爲float

1

打印(DF)

  c       d     
     sum  mean size  sum   mean size 
a b              
25 20 107.0 0.804511 133.0 5328000 40060.150376 133 
    21 110.0 0.774648 142.0 6031000 42471.830986 142 
    23 126.0 0.792453 159.0 8795000 55314.465409 159 
    24 72.0 0.947368 76.0 2920000 38421.052632 76 
    25 54.0 0.818182 66.0 2570000 38939.393939 66 
26 23 126.0 0.792453 159.0 8795000 55314.465409 159 

我喜歡:

df = df.unstack().replace(np.nan,0).stack(-1) 
print(df) 
        c       d     
       mean size sum   mean size  sum 
    a b               
    25 20 0.804511 133.0 107.0 40060.150376 133.0 5328000.0 
     21 0.774648 142.0 110.0 42471.830986 142.0 6031000.0 
     23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0 
     24 0.947368 76.0 72.0 38421.052632 76.0 2920000.0 
     25 0.818182 66.0 54.0 38939.393939 66.0 2570000.0 
    26 20 0.000000 0.0 0.0  0.000000 0.0  0.0 
     21 0.000000 0.0 0.0  0.000000 0.0  0.0 
     23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0 
     24 0.000000 0.0 0.0  0.000000 0.0  0.0 
     25 0.000000 0.0 0.0  0.000000 0.0  0.0