如何在使用Pandas數據框時避免（）：循環緩慢？

for i in range(1, len(df)): 
    if df.loc[i]["identification"] == df.loc[i-1]["identification"] and df.loc[i]["date"] == df.loc[i-1]["date"]: 
     df.loc[i,"duplicate"] = 1 
    else: 
     df.loc[i,"duplicate"] = 0

當處理大尺寸的數據幀時，這種循環運行非常簡單。如何在使用Pandas數據框時避免（）：循環緩慢？

有什麼建議嗎？

來源

2016-11-15 Gursel Karacor

請提供更多細節：什麼是「慢」，什麼是「大尺寸」。 – Danra

嘗試使用量化的方法，而不是循環的：

df['duplicate'] = np.where((df.identification == df.identification.shift()) 
          & 
          (df.date == df.date.shift()), 
          1,0)

來源

2016-11-15 20:54:02 MaxU

太好了，這真的是我想要的，在運行時間上有了巨大的提高，謝謝。 –

看起來你只是檢查，如果值是重複的。在這種情況下，您可以使用

df.sort_values(by=['identification', 'date'], inplace=True) 
df['duplicate'] = df.duplicated(subset=['identification', 'date']).astype(int)

來源

2016-11-15 20:56:55 kgully

排序已經完成，但您的建議也可以正常工作，謝謝。 –

如何在使用Pandas數據框時避免（）：循環緩慢？

回答

相關問題