如何有效地「拉伸」數組中的當前值而不使用缺省值

其中「缺席」可以表示nan或np.masked，取其中最容易實現的值。如何有效地「拉伸」數組中的當前值而不使用缺省值

例如：

>>> from numpy import nan 
>>> do_it([1, nan, nan, 2, nan, 3, nan, nan, 4, 3, nan, 2, nan]) 
array([1, 1, 1, 2, 2, 3, 3, 3, 4, 3, 3, 2, 2]) 
# each nan is replaced with the first non-nan value before it 
>>> do_it([nan, nan, 2, nan]) 
array([nan, nan, 2, 2]) 
# don't care too much about the outcome here, but this seems sensible

我可以看到你是如何做到這一點有一個for循環：

def do_it(a): 
    res = [] 
    last_val = nan 
    for item in a: 
     if not np.isnan(item): 
      last_val = item 
     res.append(last_val) 
    return np.asarray(res)

是否有向量化它更快的方法？

來源

2016-12-14 Eric

從@本傑明的刪除解決方案時，一切都很好，如果你與指數

def do_it(data, valid=None, axis=0): 
    # normalize the inputs to match the question examples 
    data = np.asarray(data) 
    if valid is None: 
     valid = ~np.isnan(data) 

    # flat array of the data values 
    data_flat = data.ravel() 

    # array of indices such that data_flat[indices] == data 
    indices = np.arange(data.size).reshape(data.shape) 

    # thanks to benjamin here 
    stretched_indices = np.maximum.accumulate(valid*indices, axis=axis) 
    return data_flat[stretched_indices]

比較的解決方案運行時的工作：

>>> import numpy as np 
>>> data = np.random.rand(10000) 

>>> %timeit do_it_question(data) 
10000 loops, best of 3: 17.3 ms per loop 
>>> %timeit do_it_mine(data) 
10000 loops, best of 3: 179 µs per loop 
>>> %timeit do_it_user(data) 
10000 loops, best of 3: 182 µs per loop 

# with lots of nans 
>>> data[data > 0.25] = np.nan 

>>> %timeit do_it_question(data) 
10000 loops, best of 3: 18.9 ms per loop 
>>> %timeit do_it_mine(data) 
10000 loops, best of 3: 177 µs per loop 
>>> %timeit do_it_user(data) 
10000 loops, best of 3: 231 µs per loop

因此，無論這一點，並@ user2357112的解決方案吹的解決方案問題出在水面上，但是當有大量的nan s時，這比@ user2357112略有優勢

來源

2016-12-14 19:15:44 Eric

cumsum明過的標誌的陣列提供了一個很好的方法，以確定在所述的NaN寫哪些號碼：

def do_it(x): 
    x = np.asarray(x) 

    is_valid = ~np.isnan(x) 
    is_valid[0] = True 

    valid_elems = x[is_valid] 
    replacement_indices = is_valid.cumsum() - 1 
    return valid_elems[replacement_indices]

來源

2016-12-14 18:11:15 user2357112

嗯，如果x是2d，這不起作用，但我想這不是我所要求的 – Eric

@Eric：是的，我不知道你甚至想要2D輸入。 – user2357112

我希望它可以獨立處理每一行，就像它是1d一樣 – Eric

假設有在數據沒有零點（爲了使用numpy.nan_to_num）：

b = numpy.maximum.accumulate(numpy.nan_to_num(a)) 
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 4.]) 
mask = numpy.isnan(a) 
a[mask] = b[mask] 
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 3.])

編輯：正如埃裏克，指出了一個更好的解決方案是-inf取代的NaN：

mask = numpy.isnan(a) 
a[mask] = -numpy.inf 
b = numpy.maximum.accumulate(a) 
a[mask] = b[mask]

來源

2016-12-14 18:54:56 Benjamin

不錯！用'-inf'代替'nan'也可以在這裏工作，對吧？ – Eric

@Eric：的確，更好的解決方案。 – Benjamin

等一下，這是行不通的。看到我更新的測試用例。你認爲這些數值總是在增加，他們不是 – Eric

如何有效地「拉伸」數組中的當前值而不使用缺省值

回答

相關問題