2017-03-24 40 views
0

我想選擇具有大於1的值所有區域,如果它們被連接到具有值的元素以上5. 兩個如果它們由0在熊貓系列查找相鄰區域

分離未連接值對於下面的數據集,

pd.Series(data = [0,2,0,2,3,6,3,0]) 

輸出應該是

pd.Series(data = [False,False,False,True,True,True,True,False]) 
+1

第二個2與高於5的值不相鄰。您能澄清定義嗎? –

+0

這個澄清了嗎? –

+1

嚴格超過1或> = 1? – FLab

回答

1

嘛,貌似我已經找到了一個內膽,利用大熊貓GROUPBY功能:

import pandas as pd 

ts = pd.Series(data = [0,2,0,2,3,6,3,0]) 

# The flag column allows me to identify sequences. Here 0s are included 
# in the "sequence", but as you can see in next line doesn't matter 
df = pd.concat([ts, (ts==0).cumsum()], axis = 1, keys = ['val', 'flag']) 

# val flag 
#0 0  1 
#1 2  1 
#2 0  2 
#3 2  2 
#4 3  2 
#5 6  2 
#6 3  2 
#7 0  3 

# For each group (having the same flag), I do a boolean AND of two conditions: 
# any value above 5 AND value above 1 (which excludes zeros) 
df.groupby('flag').transform(lambda x: (x>5).any() * x > 1) 

#Out[32]: 
#  val 
#0 False 
#1 False 
#2 False 
#3 True 
#4 True 
#5 True 
#6 True 
#7 False 

如果你想知道,您可以在一個行崩潰的一切:

ts.groupby((ts==0).cumsum()).transform(lambda x: (x>5).any() * x > 1).astype(bool) 

我仍然參考我的第一種方法:

import itertools 
import pandas as pd 

def flatten(l): 
    # Util function to flatten a list of lists 
    # e.g. [[1], [2,3]] -> [1,2,3] 
    return list(itertools.chain(*l)) 

ts = pd.Series(data = [0,2,0,2,3,6,3,0]) 
#Get data as list 
values = ts.values.tolist() 

# From what I understand the 0s delimit subsequences (so numbers are not 
# connected if separated by a 0 

# Get location of zeros 
gap_loc = [idx for (idx, el) in enumerate(values) if el==0] 
# Re-create pandas series 
gap_series = pd.Series(False, index = gap_loc) 

# Get values and locations of the subsequences (i.e. seperated by zeros) 
valid_loc = [range(prev_gap+1,gap) for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])] 
list_seq = [values[prev_gap+1:gap] for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])] 
# list_seq = [[2], [2, 3, 6, 3]] 

# Verify your condition 
check_condition = [[el>1 and any(map(lambda x: x>5, sublist)) for el in sublist] 
        for sublist in list_seq] 
# Put results back into a pandas Series 
valid_series = pd.Series(flatten(check_condition), index = flatten(valid_loc)) 

# Put everything together: 
result = pd.concat([gap_series, valid_series], axis = 0).sort_index() 

#result 
#Out[101]: 
#0 False 
#1 False 
#2 False 
#3  True 
#4  True 
#5  True 
#6  True 
#7 False 
#dtype: bool 
+0

您可能想要檢查新的單線解決方案 – FLab

0

我解決了它自己在一個醜陋的方式,請參見下文。但是,我仍然想知道是否有更好的方法來做到這一點。

test_series = pd.Series(data = [0,2,0,2,3,6,3,0]) 

bool_df = pd.DataFrame(data= [(test_series>1), (test_series>5)]).T 
bool_df.loc[:,0] = (bool_df.loc[:,0])&(~bool_df.loc[:,1]) 
# make a boolean DataFrame. 
# Column 0 is values between 1 and 5, and column 1 is values above 5. 
# the resulting boolean series we are looking for is column 1 after it has been modified in the following way. 



k=0 # k is an integer that indexes the bool_df values that are less than 1 
while k < len(bool_df.loc[bool_df.loc[:,0],0]): 
    i = bool_df.loc[bool_df.loc[:,0],0].index[k] # the bool_df index corresponding to k 
    if i > 0: # avoid negative indeces 
     if bool_df.loc[i-1,1]: # Check if the previous entry had a value above 5 
      bool_df.loc[i,1] = True 
      k+=1 
     else: 
      j=i 
      while bool_df.loc[j,0]: # find the end of the streak of 1<values<5. 
       j+=1 
      bool_df.loc[i:j,1] = bool_df.loc[j,1] # set the whole streak to the value found at the end, either >5 or <1 
      k = sum(bool_df.loc[bool_df.loc[:,0],0].index<j) 
    else: 
     k+=1