2017-09-13 62 views
2

假設我有一個像這樣的Pandas系列布爾值。增加陣列中的連續正數組/

vals = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1]).astype(bool) 

>>> vals 
0  False 
1  False 
2  False 
3  True 
4  True 
5  True 
6  True 
7  False 
8  False 
9  True 
10  True 
11 False 
12  True 
13  True 
14  True 
dtype: bool 

我想打開這個布爾系列爲一系列其中每個組的1的適當列舉,像這樣

0  0 
1  0 
2  0 
3  1 
4  1 
5  1 
6  1 
7  0 
8  0 
9  2 
10 2 
11 0 
12 3 
13 3 
14 3 

我怎麼能這樣做有效地


我已經能夠手動這樣做了,循環遍歷Python級別的序列並遞增,但是這顯然很慢。我正在尋找一個矢量化的解決方案 - 我看到this answer from unutbu涉及在NumPy中增加羣組的分裂,並試圖讓它與某種cumsum一起工作,但目前爲止尚未成功。

回答

3

你可以試試這個:

vals.astype(int).diff().fillna(vals.iloc[0]).eq(1).cumsum().where(vals, 0) 

#0  0 
#1  0 
#2  0 
#3  1 
#4  1 
#5  1 
#6  1 
#7  0 
#8  0 
#9  2 
#10 2 
#11 0 
#12 3 
#13 3 
#14 3 
#dtype: int64 
1
m=(vals.diff().ne(0)&vals.ne(0)).cumsum() 
m[vals.eq(0)]=0 
m 
Out[235]: 
0  0 
1  0 
2  0 
3  1 
4  1 
5  1 
6  1 
7  0 
8  0 
9  2 
10 2 
11 0 
12 3 
13 3 
14 3 
dtype: int32 

數據輸入

vals = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1]) 
3

這裏有一個NumPy的方法 -

def island_same_label(vals): 

    # Get array for faster processing with NumPy tools, ufuncs 
    a = vals.values 

    # Initialize output array 
    out = np.zeros(a.size, dtype=int) 

    # Get start indices for each island of 1s. Set those as 1s 
    out[np.flatnonzero(a[1:] > a[:-1])+1] = 1 

    # In case 1st element was True, we would have missed it earlier, so add that 
    out[0] = a[0] 

    # Finally cumsum and mask out non-island regions 
    np.cumsum(out, out=out) 
    return pd.Series(np.where(a, out, 0)) 

使用日e樣品和平鋪多次輸入 -

In [15]: vals=pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1]).astype(bool) 

In [16]: vals = pd.Series(np.tile(vals,10000)) 

In [17]: %timeit Psidom_app(vals) # @Psidom's soln 
    ...: %timeit Wen_app(vals) # @Wen's soln 
    ...: %timeit island_same_label(vals) # Proposed in this post 
    ...: 
100 loops, best of 3: 9.53 ms per loop 
100 loops, best of 3: 13.2 ms per loop 
1000 loops, best of 3: 959 µs per loop