熊貓，GROUPBY和組發現的最大，返回值和計數

我有日誌數據的熊貓數據幀：熊貓，GROUPBY和組發現的最大，返回值和計數

 host service 
0 this.com mail 
1 this.com mail 
2 this.com  web 
3 that.com mail 
4 other.net mail 
5 other.net  web 
6 other.net  web

我要找到每一個主機給出了最錯誤的服務：

 host service no 
0 this.com mail 2 
1 that.com mail 1 
2 other.net  web 2

我發現的唯一解決方案是按主機和服務分組，然後在索引的0級上迭代。

任何人都可以推薦更好，更短的版本嗎？沒有迭代？

df = df_logfile.groupby(['host','service']).agg({'service':np.size}) 

df_count = pd.DataFrame() 
df_count['host'] = df_logfile['host'].unique() 
df_count['service'] = np.nan 
df_count['no'] = np.nan 

for h,data in df.groupby(level=0): 
    i = data.idxmax()[0] 
    service = i[1]    
    no = data.xs(i)[0] 
    df_count.loc[df_count['host'] == h, 'service'] = service 
    df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no'] = no

全碼https://gist.github.com/bjelline/d8066de66e305887b714

來源

2014-11-02 bjelli

鑑於df，下一個步驟是組單獨由host值和
骨料通過idxmax。這給你指定哪個對應最大的服務值。然後，您可以使用df.loc[...]選擇在df對應於最大貢獻值的行：

import numpy as np 
import pandas as pd 

df_logfile = pd.DataFrame({ 
    'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net', 
       'other.net', 'other.net'], 
    'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] }) 

df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'}) 
mask = df.groupby(level=0).agg('idxmax') 
df_count = df.loc[mask['no']] 
df_count = df_count.reset_index() 
print("\nOutput\n{}".format(df_count))

產生數據幀

 host service no 
0 other.net  web 2 
1 that.com mail 1 
2 this.com mail 2

來源

2014-11-02 17:19:01 unutbu

這個成語可能使一個很好的補充GROUPBY API：HTTPS：/ /github.com/pydata/pandas/issues/8717 – Jeff 2014-11-02 21:48:34

熊貓，GROUPBY和組發現的最大，返回值和計數

回答

相關問題