查找開始和連續值的級間以方框的Python/numpy的/大熊貓

我想找到開始並且在一個numpy的陣列或優選大熊貓數據幀停止相同的值的塊的索引（塊沿着所述列的2D陣列，並且沿着維數組的最快速變化的索引）。我只在一個維度上查找塊，並且不想在不同行上聚集nans。查找開始和連續值的級間以方框的Python/numpy的/大熊貓

從這個問題（Find large number of consecutive values fulfilling condition in a numpy array）開始，我寫了下面溶液發現np.nan用於2D陣列：

import numpy as np 
a = np.array([ 
     [1, np.nan, np.nan, 2], 
     [np.nan, 1, np.nan, 3], 
     [np.nan, np.nan, np.nan, np.nan] 
    ]) 

nan_mask = np.isnan(a) 
start_nans_mask = np.hstack((np.resize(nan_mask[:,0],(a.shape[0],1)), 
          np.logical_and(np.logical_not(nan_mask[:,:-1]), nan_mask[:,1:]) 
          )) 
stop_nans_mask = np.hstack((np.logical_and(nan_mask[:,:-1], np.logical_not(nan_mask[:,1:])), 
          np.resize(nan_mask[:,-1], (a.shape[0],1)) 
          )) 

start_row_idx,start_col_idx = np.where(start_nans_mask) 
stop_row_idx,stop_col_idx = np.where(stop_nans_mask)

這讓我例如施加之前分析缺失值的貼片的長度的分佈pd.fillna。

stop_col_idx - start_col_idx + 1 
array([2, 1, 1, 4], dtype=int64)

再舉一個例子和預期結果：

a = np.array([ 
     [1, np.nan, np.nan, 2], 
     [np.nan, 1, np.nan, np.nan], 
     [np.nan, np.nan, np.nan, np.nan] 
    ]) 

array([2, 1, 2, 4], dtype=int64)

，而不是

array([2, 1, 6], dtype=int64)

我的問題有以下幾點：

有沒有辦法來優化我的解決方案（尋找開始和結束在一次傳遞掩碼/在哪裏操作）？
是否有大熊貓更優化的解決方案嗎？（即，不同的解決方案不是僅僅施加掩模/其中的數據幀的值）
時會發生什麼底層陣列或數據幀是要大，以適應存儲器？

來源

2013-02-25 Guillaume

以下任何dimensionnality（NDIM = 2或更多）基於numpy的-實現：

def get_nans_blocks_length(a): 
    """ 
    Returns 1D length of np.nan s block in sequence depth wise (last axis). 
    """ 
    nan_mask = np.isnan(a) 
    start_nans_mask = np.concatenate((np.resize(nan_mask[...,0],a.shape[:-1]+(1,)), 
           np.logical_and(np.logical_not(nan_mask[...,:-1]), nan_mask[...,1:]) 
           ), axis=a.ndim-1) 
    stop_nans_mask = np.concatenate((np.logical_and(nan_mask[...,:-1], np.logical_not(nan_mask[...,1:])), 
           np.resize(nan_mask[...,-1], a.shape[:-1]+(1,)) 
           ), axis=a.ndim-1) 

    start_idxs = np.where(start_nans_mask) 
    stop_idxs = np.where(stop_nans_mask) 
    return stop_idxs[-1] - start_idxs[-1] + 1

這樣：

a = np.array([ 
     [1, np.nan, np.nan, np.nan], 
     [np.nan, 1, np.nan, 2], 
     [np.nan, np.nan, np.nan, np.nan] 
    ]) 
get_nans_blocks_length(a) 
array([3, 1, 1, 4], dtype=int64)

和：

a = np.array([ 
     [[1, np.nan], [np.nan, np.nan]], 
     [[np.nan, 1], [np.nan, 2]], 
     [[np.nan, np.nan], [np.nan, np.nan]] 
    ]) 
get_nans_blocks_length(a) 
array([1, 2, 1, 1, 2, 2], dtype=int64)

來源

2013-03-04 11:22:24 Guillaume

不錯的小片段......實際上這對ndim = 1也不應該太過分了。 – goofd 2014-06-19 21:25:37

我裝你np.array成數據幀：

In [26]: df 
Out[26]: 
    0 1 2 3 
0 1 NaN NaN 2 
1 NaN 1 NaN 2 
2 NaN NaN NaN NaN

然後調換，並把它變成一個系列。我認爲這是類似於np.hstack：

In [28]: s = df.T.unstack(); s 
Out[28]: 
0 0  1 
    1 NaN 
    2 NaN 
    3  2 
1 0 NaN 
    1  1 
    2 NaN 
    3  2 
2 0 NaN 
    1 NaN 
    2 NaN 
    3 NaN

這個表達式創建了一個系列，其中的數字代表塊遞增1爲每個非空值：

In [29]: s.notnull().astype(int).cumsum() 
Out[29]: 
0 0 1 
    1 1 
    2 1 
    3 2 
1 0 2 
    1 3 
    2 3 
    3 4 
2 0 4 
    1 4 
    2 4 
    3 4

這個表達式創建了一個系列的每楠哪裏是1和其他一切是零：

In [31]: s.isnull().astype(int) 
Out[31]: 
0 0 0 
    1 1 
    2 1 
    3 0 
1 0 1 
    1 0 
    2 1 
    3 0 
2 0 1 
    1 1 
    2 1 
    3 1

我們可以通過以下方式將兩者結合起來，以達到你所需要的計數：

In [32]: s.isnull().astype(int).groupby(s.notnull().astype(int).cumsum()).sum() 
Out[32]: 
1 2 
2 1 
3 1 
4 4

來源

2013-02-25 19:59:01 Zelazny7

Waow ，這是我總是留下深刻印象的熊貓魔術！但是，您的實現認爲連續的nans，但不同的列/行實際上屬於同一個「塊」。我已經創建了一個小ipython筆記本（http://nbviewer.ipython.org/url/www.guillaumeallain.info/ipython_notebooks/find_nan_blocks.ipynb）來展示這個問題。在性能方面，numpy的實現也是約。快3倍。 – Guillaume 2013-02-26 12:35:47

查找開始和連續值的級間以方框的Python/numpy的/大熊貓

回答

相關問題