根據比日期早的條目的存在（不存在）篩選熊貓數據框的條目

我有一個DataFrame包含測試運行，日期和結果。它看起來像這樣：根據比日期早的條目的存在（不存在）篩選熊貓數據框的條目

TestName;Date;IsPassed 
test1;1/31/2017 9:44:30 PM;0 
test1;1/31/2017 9:39:00 PM;0 
test1;1/31/2017 9:38:29 PM;1 
test1;1/31/2017 9:38:27 PM;1 
test2;10/31/2016 5:05:02 AM;0 
test3;12/7/2016 8:58:36 PM;0 
test3;12/7/2016 8:57:19 PM;0 
test3;12/7/2016 8:56:15 PM;0 
test4;12/5/2016 6:50:49 PM;0 
test4;12/5/2016 6:49:50 PM;0 
test4;12/5/2016 3:23:09 AM;1 
test4;12/4/2016 11:51:29 PM;1

我希望能夠找出在指定日期之前或之後沒有運行的測試名稱。

我當然可以是這樣的：

識別所有獨特的測試名稱
對於他們每個人找出它們的最小和最大日期
此基礎上對各行添加到一個新的DataFrame

但是有沒有辦法與熊貓本身做到這一點，沒有明確的循環？

更新

基於由@jezrael比方說，我想只保留只發生在2016年。然後，我必須做這樣的測試運行的解決方案嗎？

idx = test_runs.groupby('TestName').Date.agg(['idxmax']).stack().unique() 
selected = test_runs.loc[idx].Date < pd.to_datetime('2017-01-01') 
tests = test_runs.loc[idx].loc[selected].TestName 
print(test_runs[test_runs.TestName.isin(tests)])

輸出：

TestName    Date IsPassed 
4  test2 2016-10-31 05:05:02   0 
5  test3 2016-12-07 20:58:36   0 
6  test3 2016-12-07 20:57:19   0 
7  test3 2016-12-07 20:56:15   0 
8  test4 2016-12-05 18:50:49   0 
9  test4 2016-12-05 18:49:50   0 
10 test4 2016-12-05 03:23:09   1 
11 test4 2016-12-04 23:51:29   1

來源

2017-02-10 Nick Slavsky

我認爲你需要groupby與aggidxmax和 idxmin爲回報min和max日期index值，然後通過stack重塑到Series。也有必要刪除重複由unique爲一個row組像test2。

最後由loc選擇所有的行：

df.Date = pd.to_datetime(df.Date) 
idx = df.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique() 
print (idx) 
[ 3 0 4 7 5 11 8] 

selected = df.loc[idx] 
print (selected) 
    TestName    Date IsPassed 
3  test1 2017-01-31 21:38:27   1 
0  test1 2017-01-31 21:44:30   0 
4  test2 2016-10-31 05:05:02   0 
7  test3 2016-12-07 20:56:15   0 
5  test3 2016-12-07 20:58:36   0 
11 test4 2016-12-04 23:51:29   1 
8  test4 2016-12-05 18:50:49   0

如果需要排序Index添加numpy.sort，因爲unique輸出numpy array。

print (df.loc[np.sort(idx)]) 
    TestName    Date IsPassed 
0  test1 2017-01-31 21:44:30   0 
3  test1 2017-01-31 21:38:27   1 
4  test2 2016-10-31 05:05:02   0 
5  test3 2016-12-07 20:58:36   0 
7  test3 2016-12-07 20:56:15   0 
8  test4 2016-12-05 18:50:49   0 
11 test4 2016-12-04 23:51:29   1

編輯：

您的代碼看起來不錯，只是加入了一些改進：

idx = test_runs.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique() 
#get output to variable, then not need select twice 
df1 = test_runs.loc[idx] 
#cast to datetime is not necessary 
selected = df1['Date'] < '2017-01-01' 
#for selecting in DataFrame is used df[index_val, column_name] 
tests = df1.loc[selected, 'TestName'] 
#for better performance in large df was add unique 
print(test_runs[test_runs.TestName.isin(tests.unique())]) 
    TestName    Date IsPassed 
4  test2 2016-10-31 05:05:02   0 
5  test3 2016-12-07 20:58:36   0 
6  test3 2016-12-07 20:57:19   0 
7  test3 2016-12-07 20:56:15   0 
8  test4 2016-12-05 18:50:49   0 
9  test4 2016-12-05 18:49:50   0 
10 test4 2016-12-05 03:23:09   1 
11 test4 2016-12-04 23:51:29   1

來源

2017-02-10 19:39:37 jezrael

謝謝！這與我所需要的非常接近。假設我的任務是隻有發生在2016年11月1日以後的運行（不考慮測試2）。我需要使用像這樣的：'idx = test_runs.groupby（'TestName'）。Date.agg（['idxmax']）。stack（）。unique（） selected = test_runs.loc [idx] .Date> pd.to_datetime（ '2016年11月1日'）測試= test_runs.loc [IDX]的.loc [所選] .TestName 打印（test_runs [test_runs.TestName.isin（測試）]）' –

我不確定是否理解，但您似乎可以在輸出中簡單地使用['boolean indexing']（http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing）。請檢查答案的最後編輯。 – jezrael

呃，這有點棘手。在得到'selected'後，我需要返回到_original_數據框，並只保留選定的測試名稱，但保留所有日期。我會在一分鐘內更新問題 –

根據比日期早的條目的存在（不存在）篩選熊貓數據框的條目

回答

相關問題