2016-02-12 87 views
7

我有出租車數據的兩列,看起來像這樣一個數據幀:GROUP BY和發現前n value_counts大熊貓

Neighborhood Borough  Time 
Midtown   Manhattan  X 
Melrose   Bronx   Y 
Grant City  Staten Island Z 
Midtown   Manhattan  A 
Lincoln Square Manhattan  B 

基本上,每一行代表在市鎮在附近出租車皮卡。現在,我想找到每個行政區的前五個社區,其中皮卡的數量最多。我嘗試這樣做:

df['Neighborhood'].groupby(df['Borough']).value_counts() 

,給了我這樣的事情:

borough       
Bronx   High Bridge   3424 
       Mott Haven   2515 
       Concourse Village  1443 
       Port Morris   1153 
       Melrose    492 
       North Riverdale  463 
       Eastchester   434 
       Concourse    395 
       Fordham    252 
       Wakefield    214 
       Kingsbridge   212 
       Mount Hope    200 
       Parkchester   191 
...... 

Staten Island Castleton Corners  4 
       Dongan Hills    4 
       Eltingville    4 
       Graniteville    4 
       Great Kills    4 
       Castleton    3 
       Woodrow     1 

如何過濾它,這樣我只得到了前5名從各個?我知道有幾個問題有相似的標題,但對我的案例沒有幫助。

回答

11

我認爲你可以使用nlargest - 你可以改變15:都拿到創建

s = df['Neighborhood'].groupby(df['Borough']).value_counts() 
print s 
Borough      
Bronx   Melrose   7 
Manhattan  Midtown   12 
       Lincoln Square  2 
Staten Island Grant City  11 
dtype: int64 

print s.groupby(level=[0,1]).nlargest(1) 
Bronx   Bronx   Melrose  7 
Manhattan  Manhattan  Midtown  12 
Staten Island Staten Island Grant City 11 
dtype: int64 

附加列,指定級別的信息

+1

它正在l = 0創建一個額外的級別,只需添加s.index.droplevel(level = 0) –

+0

@Nemish Kanwar - 謝謝你的好主意。或者'print s.groupby(level = 0).nlargest(1).reset_index(level = 0,drop = True)' – jezrael

3

您可以通過稍微延長在單行做到這一點你原始groupby與'nlargest':

>>> df.groupby(['Borough', 'Neighborhood']).Neighborhood.value_counts().nlargest(5) 
Borough  Neighborhood Neighborhood 
Bronx   Melrose   Melrose   1 
Manhattan  Midtown   Midtown   1 
Manhatten  Lincoln Square Lincoln Square 1 
       Midtown   Midtown   1 
Staten Island Grant City  Grant City  1 
dtype: int64