蟒蛇大熊貓GROUPBY優化

我有一個的大數據幀許多行和列，我需要GROUPBY列「組」這裏的一個小例子蟒蛇大熊貓GROUPBY優化

group  rank    word 
0  a 0.739631   entity 
1  a 0.882556 physical_entity 
2  b 0.588045  abstraction 
3  b 0.640933   thing 
4  c 0.726738   object 
5  c 0.669280   whole 
6  d 0.006574   congener 
7  d 0.308684  living_thing 
8  d 0.638631   organism 
9  d 0.464244   benthos

基本上，我將應用一系列函數創建新列並在組之後變換現有的列，例如：

我想要實現的功能之一是top_word，它爲每個組選擇排名最高的單詞。因此，它的輸出將是Unicode列：

group top_word 
a physical_entity [0.88] 
b thing [0.64] 
c object [0.73] 
d organism [0.63]

目前，我用這個方法得不成樣子：

def top_word(tab): 
    first = tab.iloc[0] 
    res = '{} [{:.2f}]'.format(first['word'], first['rank']) 
    return [res] 

def aggr(x, fns): 
    d = {key: fn(x) for key, fn in fns.iteritems()} 
    return pd.DataFrame(d) 

fs = {'top_word': top_word} 
T = T.sort('rank', ascending=False) #sort by rank then I only have to pick the first result in the aggfunc! 
T = T.groupby('group', sort=False).apply(lambda x: aggr(x, fs)) 
T.index = T.index.droplevel(level=1)

這給（不同的結果，由於例如隨機數發生器）：

time taken: 0.0042 +- 0.0003 seconds 
       top_word 
group      
a   entity [0.07] 
b  abstraction [0.84] 
c   object [0.92] 
d   congener [0.06]

我設計了這個方法，所以我可以在任何時候應用任何希望使用表格的函數。它需要保持這種靈活性，但它看起來很可怕！有沒有更有效的方法來做這樣的事情？正在遍歷組+追加更好？

感謝

來源

2014-10-16 Lucidnonsense

您當前的方法似乎並不給你的結果尋找。這是打算，還是你想要每個組中的第一個元素？ – DSM 2014-10-16 15:57:06

對不起，我只知道編輯！ – Lucidnonsense 2014-10-16 15:57:34

那裏。我忘了那種！ – Lucidnonsense 2014-10-16 16:00:08

我認爲這個想法是groupby，再sort每個group和使用.agg()保持第一觀察：

In [192]: 

print df 
    group  rank    word 
0  a 0.739631   entity 
1  a 0.882556 physical_entity 
2  b 0.588045  abstraction 
3  b 0.640933   thing 
4  c 0.726738   object 
5  c 0.669280   whole 
6  d 0.006574   congener 
7  d 0.308684  living_thing 
8  d 0.638631   organism 
9  d 0.464244   benthos 
In [193]: 

print df.groupby('group').agg(lambda x: sorted(x, reverse=True)[0]) 
      rank    word 
group       
a  0.882556 physical_entity 
b  0.640933   thing 
c  0.726738   whole 
d  0.638631   organism 
In [194]: 

df_res = df.groupby('group').agg(lambda x: sorted(x, reverse=True)[0]) 
df_res.word+df_res['rank'].apply(lambda x: ' [%.2f]'%x) 
Out[194]: 
group 
a  physical_entity [0.88] 
b     thing [0.64] 
c     whole [0.73] 
d    organism [0.64] 
dtype: object

來源

2014-10-16 17:38:57

謝謝。我知道這種解決方案。但問題是：我可能需要應用需要訪問多個列的聚合函數（表操作）。而且我將需要將不止一種批量應用於不同的數據！ – Lucidnonsense 2014-10-16 18:20:26

蟒蛇大熊貓GROUPBY優化

回答

相關問題