2017-06-12 97 views
1

我有一個時間序列(第1列)中,用值(第2欄),這是時間序列中的每個子系列的特徵的列數據幀。 如何刪除符合條件的子系列?刪除子系列(在數據幀中的行),其滿足條件

圖片說明了什麼我想做的事情。我想刪除橙色行: enter image description here

我試圖使循環創建一個額外的列與功能,指出要刪除的行,但這種解決方案是非常計算成本昂貴(我有一列10毫米記錄)。代碼(慢溶液):

import numpy as np 
import pandas as pd 

# sample data (smaller than actual df) 
# length of df = 100; should be 10000000 in the actual data frame 
time_ser = 100*[25] 
max_num = 20 
distance = np.random.uniform(0,max_num,100) 
to_remove= 100*[np.nan] 

data_dict = {'time_ser':time_ser, 
      'distance':distance, 
      'to_remove': to_remove 
      } 

df = pd.DataFrame(data_dict) 

subser_size = 3 
maxdist = 18 


# loop which creates an additional column which indicates which indexes should be removed. 
# Takes first value in a subseries and checks if it meets the condition. 
# If it does, all values in subseries (i.e. rows) should be removed ('wrong'). 

for i,d in zip(range(len(df)), df.distance): 
    if d >= maxdist: 
     df.to_remove.iloc[i:i+subser_size] = 'wrong' 
    else: 
     df.to_remove.iloc[i] ='good' 

回答

1

您可以使用列表理解爲通過numpy.concatenatenumpy.unique創建索引的數組,刪除重複。在列

np.random.seed(123) 
time_ser = 100*[25] 
max_num = 20 
distance = np.random.uniform(0,max_num,100) 
to_remove= 100*[np.nan] 

data_dict = {'time_ser':time_ser, 
      'distance':distance, 
      'to_remove': to_remove 
      } 

df = pd.DataFrame(data_dict) 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  NaN 
1 5.722787  25  NaN 
2 4.537029  25  NaN 
3 11.026295  25  NaN 
4 14.389379  25  NaN 
5 8.462129  25  NaN 
6 19.615284  25  NaN 
7 13.696595  25  NaN 
8 9.618638  25  NaN 
9 7.842350  25  NaN 
10 6.863560  25  NaN 
11 14.580994  25  NaN 

subser_size = 3 
maxdist = 18 

print (df.index[df['distance'] >= maxdist]) 
Int64Index([6, 38, 47, 84, 91], dtype='int64') 

arr = [np.arange(i, min(i+subser_size,len(df))) for i in df.index[df['distance'] >= maxdist]] 
idx = np.unique(np.concatenate(arr)) 
print (idx) 
[ 6 7 8 38 39 40 47 48 49 84 85 86 91 92 93] 

df = df.drop(idx) 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  NaN 
1 5.722787  25  NaN 
2 4.537029  25  NaN 
3 11.026295  25  NaN 
4 14.389379  25  NaN 
5 8.462129  25  NaN 
9 7.842350  25  NaN 
10 6.863560  25  NaN 
11 14.580994  25  NaN 
... 
... 

如果需要值:

然後使用drop或者如果需要新的列loc

df['to_remove'] = 'good' 
df.loc[idx, 'to_remove'] = 'wrong' 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  good 
1 5.722787  25  good 
2 4.537029  25  good 
3 11.026295  25  good 
4 14.389379  25  good 
5 8.462129  25  good 
6 19.615284  25  wrong 
7 13.696595  25  wrong 
8 9.618638  25  wrong 
9 7.842350  25  good 
10 6.863560  25  good 
11 14.580994  25  good 
+0

感謝您接受。您也可以註冊 - 點擊接受標記上方'0'上方的小三角。謝謝。 – jezrael

相關問題