2016-05-30 167 views
2

我想在pandas數據框中逐行條件替換值,以便max(row)將保留,而行中的所有其他值將被設置爲None 。 我的直覺走向apply(),但我不確定這是正確的選擇,還是如何去做。Python:逐行替換數據幀值

實施例(但可能有多個列):

tmp= pd.DataFrame({ 
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
}) 

tmp 
    A B 
0 1 3 
1 2 4 
2 3 1 
3 4 33 
4 5 10 
5 6 9 
6 7 7 
7 8 3 
8 9 10 
9 10 10 

求購輸出:

somemagic(tmp) 
    A  B 
0 None 3 
1 None 4 
2 3  None 
3 None 33 
4 None 10 
5 None 9 
6 7  None # on tie I don't really care which one is set to None 
7 8  None 
8 None 10 
9 10  None # on tie I don't really care which one is set to None 

,關於如何實現這一任何建議?

回答

2

您可以比較DataFramemax值由eq

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 

mask = (tmp.eq(tmp.max(axis=1), axis=0)) 
print (mask) 
     A  B 
0 False True 
1 False True 
2 True False 
3 False True 
4 False True 
5 False True 
6 True True 
7 True False 
8 False True 
9 True True 

df = (tmp[mask]) 
print (df) 
     A  B 
0 NaN 3.0 
1 NaN 4.0 
2 3.0 NaN 
3 NaN 33.0 
4 NaN 10.0 
5 NaN 9.0 
6 7.0 7.0 
7 8.0 NaN 
8 NaN 10.0 
9 10.0 10.0 

,然後你可以添加NaN如果列中的值相等:

mask = (tmp.eq(tmp.max(axis=1), axis=0)) 
mask['B'] = mask.B & (tmp.A != tmp.B) 
print (mask) 
     A  B 
0 False True 
1 False True 
2 True False 
3 False True 
4 False True 
5 False True 
6 True False 
7 True False 
8 False True 
9 True False 

df = (tmp[mask]) 
print (df) 
     A  B 
0 NaN 3.0 
1 NaN 4.0 
2 3.0 NaN 
3 NaN 33.0 
4 NaN 10.0 
5 NaN 9.0 
6 7.0 NaN 
7 8.0 NaN 
8 NaN 10.0 
9 10.0 NaN 

計時len(df)=10):

In [234]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 
1000 loops, best of 3: 974 µs per loop 

In [235]: %timeit (gh(tmp)) 
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 1.64 ms per loop 

len(df)=100k):

In [244]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 
100 loops, best of 3: 7.42 ms per loop 

In [245]: %timeit (gh(t1)) 
1 loop, best of 3: 8.81 s per loop 

代碼時序

import pandas as pd 

tmp= pd.DataFrame({ 
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
}) 


tmp = pd.concat([tmp]*10000).reset_index(drop=True) 
t1 = tmp.copy() 

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 


def top(row): 
    data = row.tolist() 
    return [d if d == max(data) else None for d in data] 

def gh(tmp1): 
    return tmp1.apply(top, axis=1) 

print (gh(t1)) 
+0

我曾在我的腦海裏完全一樣:'TMP [tmp.eq(tmp.max(軸= 1),軸= 0)]':) –

+0

謝謝你們!非常感激! :) – Ruslan

2

我會建議你使用apply()。您可以如下使用它:

In [1]: import pandas as pd 

In [2]: tmp= pd.DataFrame({ 
    ...: 'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
    ...: 'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
    ...: }) 

In [3]: tmp 
Out[3]: 
    A B 
0 1 3 
1 2 4 
2 3 1 
3 4 33 
4 5 10 
5 6 9 
6 7 7 
7 8 3 
8 9 10 
9 10 10 

In [4]: def top(row): 
    ...:   data = row.tolist() 
    ...:   return [d if d == max(data) else None for d in data] 
    ...: 

In [5]: df2 = tmp.apply(top, axis=1) 

In [6]: df2 
Out[6]: 
    A B 
0 NaN 3 
1 NaN 4 
2 3 NaN 
3 NaN 33 
4 NaN 10 
5 NaN 9 
6 7 7 
7 8 NaN 
8 NaN 10 
9 10 10 
+0

謝謝!這正是*我正在尋找的東西。 *和*可以在任意數量的列上工作:) – Ruslan

+0

我認爲使用'apply'對於矢量化解決方案來說比較慢,請參閱我的解決方案中的時序。 – jezrael