掩蓋的Numpy陣列比普通的numpy陣列慢很多

我有一個函數可以計算numpy數組中所有行對之間的成對相關性。它工作正常，但後來我記得，我經常需要處理缺失的數據。我用掩碼數組來嘗試解決這個問題，但它會使計算速度減慢很多。任何有關使用蒙版功能的想法。我認爲真正的瓶頸將在np.ma.dot函數中。但我添加了一些配置文件，我很快就用iPython進行了嘲弄。根據這些陣列所具有的行數，我應該說5000就在頻譜的低端。有些可能超過300,000。帶有掩碼數組的代碼看起來比沒有代碼慢20倍，但顯然缺少數據，沒有掩碼數組的代碼每隔一段時間就會產生一個NaN。掩蓋的Numpy陣列比普通的numpy陣列慢很多

首先一個快速和骯髒的方式來產生一些示例數據

genotypes = np.empty((5000,200)) 
for i in xrange(5000): 
    genotypes[i] = np.random.binomial(3,.333, size=200) 

pValues = np.random.uniform(0,1,5000)

然後功能測試

def testMask(pValsArray, genotypeArray): 
    nSNPs = len(pValsArray)-1 
    genotypeArray = np.ma.masked_array(genotypeArray, np.isnan(genotypeArray)) 
    chisq = np.sum(-2 * np.log(pValsArray)) 
    ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)] 
    datam = genotypeArray - ms 
    datass = np.ma.sqrt(np.ma.sum(datam**2, axis=1)) 
    runningSum = 0 
    for i in xrange(nSNPs): 
     temp = np.ma.dot(datam[i:],datam[i].T) 
     d = (datass[i:]*datass[i]) 
     rs = temp/d 
     rs = np.absolute(rs)[1:] 
     runningSum += 3.25 * np.sum(rs) + .75 * np.dot(rs, rs) 
    sigmaSq = 4*nSNPs+2*runningSum 
    E = 2*nSNPs 
    df = (2*(E*E))/sigmaSq 
    runningSum = sigmaSq/(2*E) 
    d = chisq/runningSum 
    brownsP = stats.chi2.sf(d, df) 
    return brownsP 

def testNotMask(pValsArray, genotypeArray): 
    nSNPs = len(pValsArray)-1 
    chisq = np.sum(-2 * np.log(pValsArray)) 
    ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)] 
    datam = genotypeArray - ms 
    datass = np.sqrt(stats.ss(datam, axis=1)) 
    runningSum = 0 
    for i in xrange(nSNPs): 
     temp = np.dot(datam[i:],datam[i].T) 
     d = (datass[i:]*datass[i]) 
     rs = temp/d 
     rs = np.absolute(rs)[1:] 
     runningSum += 3.25 * np.sum(rs) + .75 * np.dot(rs, rs) 
    sigmaSq = 4*nSNPs+2*runningSum 
    E = 2*nSNPs 
    df = (2*(E*E))/sigmaSq 
    runningSum = sigmaSq/(2*E) 
    d = chisq/runningSum 
    brownsP = stats.chi2.sf(d, df) 
    return brownsP

還有一些時間

%timeit testMask(pValues, genotypes) 
1 loops, best of 3: 14.3 s per loop 

%timeit testNotMask(pValues, genotypes) 
1 loops, best of 3: 678 ms per loop

添加一些丟失的數據，去再次：

randis = np.random.randint(0,5000, 10) 
randjs = np.random.randint(0,200, 10) 

for i,j in zip(randis, randjs): 
    genotypes[i,j] = None 



%timeit testMask(pValues, genotypes) 
1 loops, best of 3: 14.2 s per loop 

%timeit testNotMask(pValues, genotypes) 
1 loops, best of 3: 654 ms per loop

和一些剖析：

%prun 

     2559677 function calls in 15.045 seconds 

    Ordered by: internal time 

    ncalls tottime percall cumtime percall filename:lineno(function) 
    9791 5.259 0.001 5.259 0.001 {method 'copy' of 'numpy.ndarray' objects} 
    4999 2.877 0.001 11.888 0.002 extras.py:949(dot) 
    9794 1.566 0.000 1.566 0.000 {numpy.core.multiarray.copyto} 
    14997 1.497 0.000 1.564 0.000 {numpy.core._dotblas.dot} 
    30007 0.559 0.000 0.559 0.000 {method 'reduce' of 'numpy.ufunc' objects} 
    94996 0.450 0.000 0.875 0.000 core.py:2751(_update_from) 
    864970 0.347 0.000 0.347 0.000 {getattr} 
    5000 0.346 0.000 0.802 0.000 core.py:1065(__call__) 
     1 0.240 0.240 15.045 15.045 <ipython-input-115-2aab1c8ea4c5>:1(testMask) 
    5000 0.196 0.000 0.196 0.000 core.py:771(__call__) 
    5001 0.147 0.000 0.551 0.000 core.py:917(__call__) 
    24996 0.143 0.000 0.609 0.000 core.py:2930(__getitem__) 
    54998 0.140 0.000 0.775 0.000 core.py:2776(__array_finalize__) 
    419985 0.126 0.000 0.126 0.000 {method 'update' of 'dict' objects} 
     1 0.093 0.093 0.111 0.111 core.py:5990(power) 
    339994 0.077 0.000 0.077 0.000 {isinstance} 
    50015 0.072 0.000 0.072 0.000 {numpy.core.multiarray.array} 
    60002 0.060 0.000 0.568 0.000 {method 'view' of 'numpy.ndarray' objects} 
    5000 0.060 0.000 0.199 0.000 core.py:2626(__new__) 
    14999 0.058 0.000 7.412 0.000 core.py:3341(filled) 
    25005 0.055 0.000 0.092 0.000 core.py:609(getdata)

編輯：

我試圖perimosocordiae的答案，但我仍然得到nan秒。它看起來像mean,stats.ss和np.sqrt函數都關心nan值。

def fastNotMask(pValsArray, genotypeArray): 
    nSNPs = len(pValsArray)-1 
    chisq = np.sum(-2 * np.log(pValsArray)) 
    ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)] 
    print ms 
    datam = genotypeArray - ms 
    print datam 
    datass = np.sqrt(stats.ss(datam, axis=1)) 
    print datass 
    runningSum = 0 
    for i in xrange(nSNPs): 
     temp = np.dot(datam[i:],datam[i].T) 
     d = (datass[i:]*datass[i]) 
     rs = temp/d 
     rs = np.absolute(rs)[1:] 
     runningSum += 3.25 * np.nansum(rs) + .75 * np.nansum(rs * rs) 
    print runningSum 
    sigmaSq = 4*nSNPs+2*runningSum 
    E = 2*nSNPs 
    df = (2*(E*E))/sigmaSq 
    runningSum = sigmaSq/(2*E) 
    d = chisq/runningSum 
    brownsP = stats.chi2.sf(d, df) 
    return brownsP

用少量輸出測試這個結果表明nan沒有得到正確的處理。

pValues = np.random.uniform(0,1,10) 

genotypes = np.empty((10,10)) 
for i in xrange(10): 
    genotypes[i] = np.random.binomial(2,.5, size=10) 

randis = np.random.randint(0,10, 2) 
randjs = np.random.randint(0,10, 2) 

for i,j in zip(randis, randjs): 
    genotypes[i,j] = None 

print testfastMask(pValues, genotypes) 



[[ 1.5] 
[ 1.2] 
[ 0.9] 
[ 1.2] 
[ 1.1] 
[ 0.6] 
[ nan] 
[ 1.1] 
[ nan] 
[ 0.8]] 
[[-0.5 -0.5 0.5 0.5 -0.5 -0.5 0.5 0.5 -0.5 0.5] 
[-0.2 0.8 -0.2 -0.2 -0.2 -0.2 -0.2 0.8 -0.2 -0.2] 
[-0.9 0.1 -0.9 1.1 0.1 -0.9 1.1 0.1 0.1 0.1] 
[-0.2 -0.2 -0.2 -0.2 0.8 -0.2 -0.2 -1.2 0.8 0.8] 
[-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 0.9 -0.1 0.9 -1.1] 
[-0.6 1.4 0.4 -0.6 -0.6 0.4 -0.6 -0.6 0.4 0.4] 
[ nan nan nan nan nan nan nan nan nan nan] 
[-0.1 0.9 -1.1 -1.1 0.9 -0.1 0.9 -1.1 0.9 -0.1] 
[ nan nan nan nan nan nan nan nan nan nan] 
[ 1.2 -0.8 -0.8 0.2 -0.8 0.2 1.2 0.2 0.2 -0.8]] 
[ 1.58113883 1.26491106 2.21359436 1.8973666 1.70293864 2.0976177 
     nan 2.62678511   nan 2.36643191] 
nan 
nan

我在這裏錯過了什麼嗎？這可能是一個版本問題。我使用python 2.7和numpy 1.7.1？

乾杯的幫助。

來源

2014-02-13 Davy Kavanagh

編輯：原來的答案不爲numpy的版本< = 1.8工作，其中np.nansum([NaN, NaN]) == 0.0（note the FutureWarning）。對於早期版本，您必須手動檢查情況：

tmp = 3.25 * np.nansum(rs) + .75 * np.nansum(rs * rs) 
if not np.isnan(tmp): 
    runningSum += tmp

或者，你可以建立tmp值的列表/陣列，並呼籲對np.nansum。

原來的答案：

所有你需要改變的是這一行testNotMask：

runningSum += 3.25 * np.sum(rs) + .75 * np.dot(rs, rs)

這樣：

runningSum += 3.25 * np.nansum(rs) + .75 * np.nansum(rs * rs)

其他所有操作都正常工作與NaN值，所以你可以得到所有的演講d的非屏蔽數組，同時仍然得到正確的結果。

這是證明。功能fastNotMask只是testNotMask與應用上述更改。

In [63]: genotypes = np.random.binomial(3, .333, size=(5000, 200)).astype(float) 

In [64]: pValues = np.random.uniform(0,1,5000) 

In [65]: %timeit testMask(pValues, genotypes) 
1 loops, best of 3: 11.3 s per loop 

In [66]: %timeit testNotMask(pValues, genotypes) 
1 loops, best of 3: 3.53 s per loop 

In [67]: %timeit fastNotMask(pValues, genotypes) 
1 loops, best of 3: 3.96 s per loop 

In [68]: randjs = np.random.randint(0,200, 10) 

In [69]: randis = np.random.randint(0,5000,10) 

In [70]: genotypes[randis,randjs] = None 

In [71]: %timeit testMask(pValues, genotypes) 
1 loops, best of 3: 33 s per loop 

In [72]: %timeit testNotMask(pValues, genotypes) 
1 loops, best of 3: 3.6 s per loop 

In [73]: %timeit fastNotMask(pValues, genotypes) 
1 loops, best of 3: 3.98 s per loop 

In [74]: testMask(pValues, genotypes) 
Out[74]: 0.47606794747438386 

In [75]: testNotMask(pValues, genotypes) 
Out[75]: nan 

In [76]: fastNotMask(pValues, genotypes) 
Out[76]: 0.47613597091679449

注意，有精密testMask和fastNotMask之間略有差別。我其實不確定這是從哪裏來的，但我會認爲這並不重要。

來源

2014-02-14 19:52:33 perimosocordiae

很酷，謝謝！經過數小時的谷歌搜索numpy.nansum（）從未出現。我的搜索功能很弱。乾杯。 –

實際上，在進行測試之後，我仍然從'fastNotMask'獲得'nan's的結果。往上看。對不起，如果我在這裏很愚蠢。 –

啊，這就是我使用流血的numpy。我相應地更新了我的答案。 – perimosocordiae

掩蓋的Numpy陣列比普通的numpy陣列慢很多

回答

相關問題