Python：逐行替換數據幀值

我想在pandas數據框中逐行條件替換值，以便max（row）將保留，而行中的所有其他值將被設置爲None 。我的直覺走向apply()，但我不確定這是正確的選擇，還是如何去做。Python：逐行替換數據幀值

實施例（但可能有多個列）：

tmp= pd.DataFrame({ 
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
}) 

tmp 
    A B 
0 1 3 
1 2 4 
2 3 1 
3 4 33 
4 5 10 
5 6 9 
6 7 7 
7 8 3 
8 9 10 
9 10 10

求購輸出：

somemagic(tmp) 
    A  B 
0 None 3 
1 None 4 
2 3  None 
3 None 33 
4 None 10 
5 None 9 
6 7  None # on tie I don't really care which one is set to None 
7 8  None 
8 None 10 
9 10  None # on tie I don't really care which one is set to None

，關於如何實現這一任何建議？

來源

2016-05-30 Ruslan

您可以比較DataFrame與max值由eq：

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])

mask = (tmp.eq(tmp.max(axis=1), axis=0)) 
print (mask) 
     A  B 
0 False True 
1 False True 
2 True False 
3 False True 
4 False True 
5 False True 
6 True True 
7 True False 
8 False True 
9 True True 

df = (tmp[mask]) 
print (df) 
     A  B 
0 NaN 3.0 
1 NaN 4.0 
2 3.0 NaN 
3 NaN 33.0 
4 NaN 10.0 
5 NaN 9.0 
6 7.0 7.0 
7 8.0 NaN 
8 NaN 10.0 
9 10.0 10.0

，然後你可以添加NaN如果列中的值相等：

mask = (tmp.eq(tmp.max(axis=1), axis=0)) 
mask['B'] = mask.B & (tmp.A != tmp.B) 
print (mask) 
     A  B 
0 False True 
1 False True 
2 True False 
3 False True 
4 False True 
5 False True 
6 True False 
7 True False 
8 False True 
9 True False 

df = (tmp[mask]) 
print (df) 
     A  B 
0 NaN 3.0 
1 NaN 4.0 
2 3.0 NaN 
3 NaN 33.0 
4 NaN 10.0 
5 NaN 9.0 
6 7.0 NaN 
7 8.0 NaN 
8 NaN 10.0 
9 10.0 NaN

計時（len(df)=10）：

In [234]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 
1000 loops, best of 3: 974 µs per loop 

In [235]: %timeit (gh(tmp)) 
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 1.64 ms per loop

（len(df)=100k）：

In [244]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 
100 loops, best of 3: 7.42 ms per loop 

In [245]: %timeit (gh(t1)) 
1 loop, best of 3: 8.81 s per loop

代碼時序：

import pandas as pd 

tmp= pd.DataFrame({ 
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
}) 


tmp = pd.concat([tmp]*10000).reset_index(drop=True) 
t1 = tmp.copy() 

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 


def top(row): 
    data = row.tolist() 
    return [d if d == max(data) else None for d in data] 

def gh(tmp1): 
    return tmp1.apply(top, axis=1) 

print (gh(t1))

來源

2016-05-30 11:39:13 jezrael

我曾在我的腦海裏完全一樣：'TMP [tmp.eq（tmp.max（軸= 1），軸= 0）]':) –

謝謝你們！非常感激！ :) – Ruslan

我會建議你使用apply()。您可以如下使用它：

In [1]: import pandas as pd 

In [2]: tmp= pd.DataFrame({ 
    ...: 'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
    ...: 'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
    ...: }) 

In [3]: tmp 
Out[3]: 
    A B 
0 1 3 
1 2 4 
2 3 1 
3 4 33 
4 5 10 
5 6 9 
6 7 7 
7 8 3 
8 9 10 
9 10 10 

In [4]: def top(row): 
    ...:   data = row.tolist() 
    ...:   return [d if d == max(data) else None for d in data] 
    ...: 

In [5]: df2 = tmp.apply(top, axis=1) 

In [6]: df2 
Out[6]: 
    A B 
0 NaN 3 
1 NaN 4 
2 3 NaN 
3 NaN 33 
4 NaN 10 
5 NaN 9 
6 7 7 
7 8 NaN 
8 NaN 10 
9 10 10

來源

2016-05-30 11:51:36

謝謝！這正是*我正在尋找的東西。 *和*可以在任意數量的列上工作:) – Ruslan

我認爲使用'apply'對於矢量化解決方案來說比較慢，請參閱我的解決方案中的時序。 – jezrael

Python：逐行替換數據幀值

回答

相關問題