在GroupBy忽略最大重複 - 大熊貓

我讀過這個線程關於分組和獲取最大：Apply vs transform on a group object。在GroupBy忽略最大重複 - 大熊貓

它的工作原理非常完美，如果你的max對於一個組來說是唯一的，但我遇到了一個忽略來自組的重複的問題，獲得獨特項目的最大值，然後將其放回到DataSeries中。

輸入（名爲DF1）：

date  val 
2004-01-01 0 
2004-02-01 0 
2004-03-01 0 
2004-04-01 0 
2004-05-01 0 
2004-06-01 0 
2004-07-01 0 
2004-08-01 0 
2004-09-01 0 
2004-10-01 0 
2004-11-01 0 
2004-12-01 0 
2005-01-01 11 
2005-02-01 11 
2005-03-01 8 
2005-04-01 5 
2005-05-01 0 
2005-06-01 0 
2005-07-01 2 
2005-08-01 1 
2005-09-01 0 
2005-10-01 0 
2005-11-01 3 
2005-12-01 3

我的代碼：

df1['peak_month'] = df1.groupby(df1.date.dt.year)['val'].transform(max) == df1['val']

我的輸出：

date  val max 
2004-01-01 0  true #notice how all duplicates are true in 2004 
2004-02-01 0  true 
2004-03-01 0  true 
2004-04-01 0  true 
2004-05-01 0  true 
2004-06-01 0  true 
2004-07-01 0  true 
2004-08-01 0  true 
2004-09-01 0  true 
2004-10-01 0  true 
2004-11-01 0  true 
2004-12-01 0  true 
2005-01-01 11 true #notice how these two values 
2005-02-01 11 true #are the max values for 2005 and are true 
2005-03-01 8  false 
2005-04-01 5  false 
2005-05-01 0  false 
2005-06-01 0  false 
2005-07-01 2  false 
2005-08-01 1  false 
2005-09-01 0  false 
2005-10-01 0  false 
2005-11-01 3  false 
2005-12-01 3  false

預期輸出：

date  val max 
2004-01-01 0  false #notice how all duplicates are false in 2004 
2004-02-01 0  false #because they are the same and all vals are max 
2004-03-01 0  false 
2004-04-01 0  false 
2004-05-01 0  false 
2004-06-01 0  false 
2004-07-01 0  false 
2004-08-01 0  false 
2004-09-01 0  false 
2004-10-01 0  false 
2004-11-01 0  false 
2004-12-01 0  false 
2005-01-01 11 false #notice how these two values 
2005-02-01 11 false #are the max values for 2005 but are false 
2005-03-01 8  true #this is the second max val and is true 
2005-04-01 5  false 
2005-05-01 0  false 
2005-06-01 0  false 
2005-07-01 2  false 
2005-08-01 1  false 
2005-09-01 0  false 
2005-10-01 0  false 
2005-11-01 3  false 
2005-12-01 3  false

參考：

df1 = pd.DataFrame({'val':[0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 11, 11, 8, 5, 0 , 0, 2, 1, 0, 0, 3, 3], 
'date':['2004-01-01','2004-02-01','2004-03-01','2004-04-01','2004-05-01','2004-06-01','2004-07-01','2004-08-01','2004-09-01','2004-10-01','2004-11-01','2004-12-01','2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01','2005-06-01','2005-07-01','2005-08-01','2005-09-01','2005-10-01','2005-11-01','2005-12-01',]})

來源

2016-03-08 ethanenglish

這個問題不清楚，你有太多的數據來說明你的觀點。我不知道你爲什麼要忽略重複。 [5，5，2，2]的最大值與[5，2]的最大值相同。 – Alexander

我需要最多一年的價值，或者如果它們相同，則不需要。 – ethanenglish

不靈巧的解決方案，但它的工作原理。這個想法是首先確定每年出現的獨特價值，然後對這些獨特價值進行轉型。

# Determine the unique values appearing in each year. 
df1['year'] = df1.date.dt.year 
unique_vals = df1.drop_duplicates(subset=['year', 'val'], keep=False) 

# Max transform on the unique values. 
df1['peak_month'] = unique_vals.groupby('year')['val'].transform(max) == unique_vals['val'] 

# Fill NaN's as False, drop extra column. 
df1['peak_month'].fillna(False, inplace=True) 
df1.drop('year', axis=1, inplace=True)

來源

2016-03-08 17:51:15 root

不，'keep = False'關鍵字參數強制'drop_duplicates'放棄重複數據的所有副本。如果沒有這個關鍵字參數，你的關注將是有效的，因爲'drop_duplicates'默認保持第一個重複記錄。我的代碼產生預期的輸出。 – root

@Parfait這就像一個魅力。感謝您瀏覽並瀏覽邏輯！ – ethanenglish

在GroupBy忽略最大重複 - 大熊貓

回答

相關問題