更快的numpy解決方案，而不是itertools.combinations？

我使用itertools.combinations()如下：更快的numpy解決方案，而不是itertools.combinations？

import itertools 
import numpy as np 

L = [1,2,3,4,5] 
N = 3 

output = np.array([a for a in itertools.combinations(L,N)]).T

其中產量我輸出我需要：

array([[1, 1, 1, 1, 1, 1, 2, 2, 2, 3], 
     [2, 2, 2, 3, 3, 4, 3, 3, 4, 4], 
     [3, 4, 5, 4, 5, 5, 4, 5, 5, 5]])

我反覆和過度使用在多環境中，該表達，我需要它儘可能快。

從this post我明白itertools爲基礎的代碼是不是最快的解決方案，並使用numpy可能是一種進步，但我不是在numpy優化調度技巧不夠好，理解和適應形式，它有或迭代代碼拿出我自己的優化。

任何幫助將不勝感激。

編輯：

L來源於熊貓數據幀，所以它可以也被看作是一個numpy的數組：

L = df.L.values

來源

2017-02-09 Khris

對於2D你必須'numpy.triu_in骰子「，但更高的尺寸更困難 –

['sklearn.utils.extmath.cartesian']（https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath。 PY＃L557）？ – blacksite

到目前爲止，從來沒有看過「scikit-learn」，我會閱讀它。 – Khris

這裏有一個這是略快於itertools UPDATE：和一個（nump2）這實際上是相當快一點：

import numpy as np 
import itertools 
import timeit 

def nump(n, k, i=0): 
    if k == 1: 
     a = np.arange(i, i+n) 
     return tuple([a[None, j:] for j in range(n)]) 
    template = nump(n-1, k-1, i+1) 
    full = np.r_[np.repeat(np.arange(i, i+n-k+1), 
          [t.shape[1] for t in template])[None, :], 
       np.c_[template]] 
    return tuple([full[:, j:] for j in np.r_[0, np.add.accumulate(
     [t.shape[1] for t in template[:-1]])]]) 

def nump2(n, k): 
    a = np.ones((k, n-k+1), dtype=int) 
    a[0] = np.arange(n-k+1) 
    for j in range(1, k): 
     reps = (n-k+j) - a[j-1] 
     a = np.repeat(a, reps, axis=1) 
     ind = np.add.accumulate(reps) 
     a[j, ind[:-1]] = 1-reps[1:] 
     a[j, 0] = j 
     a[j] = np.add.accumulate(a[j]) 
    return a 

def itto(L, N): 
    return np.array([a for a in itertools.combinations(L,N)]).T 

k = 6 
n = 12 
N = np.arange(n) 

assert np.all(nump2(n,k) == itto(N,k)) 

print('numpy ', timeit.timeit('f(a,b)', number=100, globals={'f':nump, 'a':n, 'b':k})) 
print('numpy 2 ', timeit.timeit('f(a,b)', number=100, globals={'f':nump2, 'a':n, 'b':k})) 
print('itertools', timeit.timeit('f(a,b)', number=100, globals={'f':itto, 'a':N, 'b':k}))

時序：

k = 3, n = 50 
numpy  0.06967267207801342 
numpy 2 0.035096961073577404 
itertools 0.7981023890897632 

k = 3, n = 10 
numpy  0.015058324905112386 
numpy 2 0.0017436158377677202 
itertools 0.004743851954117417 

k = 6, n = 12 
numpy  0.03546895203180611 
numpy 2 0.00997065706178546 
itertools 0.05292179994285107

來源

2017-02-13 10:45:27

謝謝你的努力，但'n'和'k'是什麼？是否有可能以可以使用具有任意元素的列表的方式編寫該解決方案，例如像這樣的字符串'['a'，'ret'，'g'，'fd']'？ – Khris

也像你這樣使用的函數會拋出一個錯誤：'ValueError：至少需要一個數組來連接'，它似乎來自級別10或11的迭代。 – Khris

@Kris n和k是集合的大小，子集。你可以通過創建一個對象數組'oa = np.array（['a'，'ret'，'g'，'fd']，dtype = object）'來取代任意對象的數字，然後使用輸出'出'的'nump'就像so'oa [out [0]]'。重新例外是按原樣運行腳本還是使用其他參數？如果是後者，你可以發佈它們嗎？ –

這肯定是不快於itertools.combinations但是矢量化numpy：

def nd_triu_indices(T,N): 
    o=np.array(np.meshgrid(*(np.arange(len(T)),)*N)) 
    return np.array(T)[o[...,np.all(o[1:]>o[:-1],axis=0)]] 

%timeit np.array(list(itertools.combinations(T,N))).T 
The slowest run took 4.40 times longer than the fastest. This could mean that an intermediate result is being cached. 
100000 loops, best of 3: 8.6 µs per loop 

%timeit nd_triu_indices(T,N) 
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached. 
10000 loops, best of 3: 52.4 µs per loop

不確定如果這是以另一種方式進行矢量化的話，或者如果其中一個優化嚮導可以使此方法更快。

編輯：想出了另一種方式，但仍不能快於combinations：

%timeit np.array(T)[np.array(np.where(np.fromfunction(lambda *i: np.all(np.array(i)[1:]>np.array(i)[:-1], axis=0),(len(T),)*N,dtype=int)))] 
The slowest run took 7.78 times longer than the fastest. This could mean that an intermediate result is being cached. 
10000 loops, best of 3: 34.3 µs per loop

來源

2017-02-09 16:28:13

爲什麼'eval'？這完全沒有必要。你可以建立和解壓包含'N'參數的元組。 – user2357112

我無法工作。出於某種原因'np。meshgrid（（range（5），）* 3）'給出'np.arange（[0,1,2,3,4,0,1,2,3,4,0,1,2,3,4] ）' –

你沒有解開元組。 – user2357112

更快的numpy解決方案，而不是itertools.combinations？

回答

相關問題