Interval對於一個NumPy數組，其間隔由另一個數組定義的argmax

我有一個排序的非唯一數字的一維數組。他們重複的次數是隨機的。它與具有相同大小的權重數組相關聯。對於給定的一系列相同的元素，相關的一系列權重可能也可能不具有重複的元素，並且在整個權重陣列中，可能有也可能不會有重複的元素。 E.g：Interval對於一個NumPy數組，其間隔由另一個數組定義的argmax

pos  = np.array([3, 3, 7, 7, 9, 9, 9, 10, 10]) 
weights = np.array([2, 10, 20, 8, 5, 7, 15, 7, 2])

我需要提取的pos獨特元件的陣列，但是其中唯一元件是一個具有最大權重。

我想出了一個工作的解決方案涉及循環：

pos  = np.array([3, 3, 7, 7, 9, 9, 9, 10, 10]) 
weights = np.array([2, 10, 20, 8, 5, 7, 15, 7, 2]) 
# Get the number of occurences of the elements in pos but throw away the unique array, it's not the one I want. 
_, ucounts = np.unique(pos, return_counts=True) 
# Initialize the output array. 
unique_pos_idx = np.zeros([ucounts.size], dtype=np.uint32) 

last = 0 
for i in range(ucounts.size): 
    maxpos = np.argmax(weights[last:last+ucounts[i]]) 
    unique_pos_idx[i] = last + maxpos 
    last += ucounts[i] 

# Result is: 
# unique_pos_idx = [1 2 6 7]

，但我不使用太多的Python語言或numpy的（除了使用numpy的陣列），所以我不知道是否有一個更多Pythonesque和/或更高效的解決方案，甚至比以上的Cython版本更多？

感謝

來源

2017-10-11 Wall-E

這裏有一個量化的方法 - 性能

sidx = np.lexsort([weights,pos]) 
out = sidx[np.r_[np.flatnonzero(pos[1:] != pos[:-1]), -1]]

可能改善（S） -

1]更快的方式得到分類指數sidx與scaling -

sidx = (pos*(weights.max()+1) + weights).argsort()

2]在e處的索引第二，可以由具有boolean-indexing更快，特別是用許多這樣的間隔/團體打交道時 -

out = sidx[np.concatenate((pos[1:] != pos[:-1], [True]))]

運行測試

所有的方法：

def org_app(pos, weights): 
    _, ucounts = np.unique(pos, return_counts=True) 
    unique_pos_idx = np.zeros([ucounts.size], dtype=np.uint32)  
    last = 0 
    for i in range(ucounts.size): 
     maxpos = np.argmax(weights[last:last+ucounts[i]]) 
     unique_pos_idx[i] = last + maxpos 
     last += ucounts[i] 
    return unique_pos_idx 

def vec_app(pos, weights): 
    sidx = np.lexsort([weights,pos]) 
    return sidx[np.r_[np.flatnonzero(pos[1:] != pos[:-1]), -1]] 

def vec_app_v2(pos, weights): 
    sidx = (pos*(weights.max()+1) + weights).argsort() 
    return sidx[np.concatenate((pos[1:] != pos[:-1], [True]))]

時序和驗證 -

對於設置，讓我們使用該示例並將其瓦片10000次小時縮放，因爲我們打算創建1000倍以上的時間間隔。另外，讓我們用識別號weights，使argmax指數不相同的數字迷惑：

In [155]: # Setup input 
    ...: pos = np.array([3, 3, 7, 7, 9, 9, 9, 10, 10,]) 
    ...: pos = (pos + 10*np.arange(10000)[:,None]).ravel() 
    ...: weights = np.random.choice(10*len(pos), size=len(pos), replace=0) 
    ...: 
    ...: print np.allclose(org_app(pos, weights), vec_app(pos, weights)) 
    ...: print np.allclose(org_app(pos, weights), vec_app_v2(pos, weights)) 
    ...: 
True 
True 

In [156]: %timeit org_app(pos, weights) 
    ...: %timeit vec_app(pos, weights) 
    ...: %timeit vec_app_v2(pos, weights) 
    ...: 
10 loops, best of 3: 56.4 ms per loop 
100 loops, best of 3: 14.8 ms per loop 
1000 loops, best of 3: 1.77 ms per loop 

In [157]: 56.4/1.77 # Speedup with vectorized one over loopy 
Out[157]: 31.864406779661017

來源

2017-10-11 21:23:13 Divakar

非常感謝您@Divakar。我在你的回覆中學到了很多技巧。在上面的代碼中，我應該考慮'np.flatnonzero（pos [1：]！= pos [： - 1]）是否是一種在數組中定位唯一元素的替代方法？從我的角度來看，就好像我們拒絕了零導數的位置並保留其餘部分。我是否正確理解你將-1附加到它，考慮到這種分類方式使得最後一個元素必然是最重的元素？ –

@ user31412'np.flatnonzero（pos [1：]！= pos [： - 1]）基本上是通過切片一個切片得到間隔*改變的索引。我們需要這些，因爲我們打算得到每個組的最後一個，因爲lexsort/argsort會代表每個組/間隔的最大參數。最後-1是需要的，因爲切片會錯過獲取最後一組的最後一個元素。所以，我們手動添加。這給了我們最後一組的argmax，因爲sidx會把那個作爲最後一個元素。希望這是有道理的。 – Divakar

當然，這對我有意義。我要玩你提供的不同方法。非常感謝。只要我確信我能理解你寫的所有內容，我就會馬上接受這個答案。 –

Interval對於一個NumPy數組，其間隔由另一個數組定義的argmax

回答

相關問題