2017-02-20 109 views
1

我想加快我的代碼使用cython。將代碼翻譯成Python的cython後,我看到我沒有獲得任何加速。我認爲問題的根源在於我將numpy數組轉換爲cython時表現不佳。Cython:緩慢的numpy陣列

我已經想出了一個非常簡單的程序,以顯示這一點:

############### test.pyx ################# 
import numpy as np 
cimport numpy as np 
cimport cython 

def func1(long N): 

    cdef double sum1,sum2,sum3 
    cdef long i 

    sum1 = 0.0 
    sum2 = 0.0 
    sum3 = 0.0 

    for i in range(N): 
     sum1 += i 
     sum2 += 2.0*i 
     sum3 += 3.0*i 

    return sum1,sum2,sum3 

def func2(long N): 

    cdef np.ndarray[np.float64_t,ndim=1] sum_arr 
    cdef long i 

    sum_arr = np.zeros(3,dtype=np.float64) 

    for i in range(N): 
     sum_arr[0] += i 
     sum_arr[1] += 2.0*i 
     sum_arr[2] += 3.0*i 

    return sum_arr 

def func3(long N): 

    cdef double sum_arr[3] 
    cdef long i 

    sum_arr[0] = 0.0 
    sum_arr[1] = 0.0 
    sum_arr[2] = 0.0 

    for i in range(N): 
     sum_arr[0] += i 
     sum_arr[1] += 2.0*i 
     sum_arr[2] += 3.0*i 

    return sum_arr 
########################################## 

################## test.py ############### 
import time 
import test as test 

N = 1000000000 

for i in xrange(10): 

    start = time.time() 
    sum1,sum2,sum3 = test.func1(N) 
    print 'Time taken = %.3f'%(time.time()-start) 

print '\n' 
for i in xrange(10): 
    start = time.time() 
    sum_arr = test.func2(N) 
    print 'Time taken = %.3f'%(time.time()-start) 

print '\n' 
for i in xrange(10): 
    start = time.time() 
    sum_arr = test.func3(N) 
    print 'Time taken = %.3f'%(time.time()-start) 
############################################ 

而且從蟒蛇test.py我得到:

Time taken = 1.445 
Time taken = 1.433 
Time taken = 1.434 
Time taken = 1.428 
Time taken = 1.449 
Time taken = 1.425 
Time taken = 1.421 
Time taken = 1.451 
Time taken = 1.483 
Time taken = 1.418 

Time taken = 2.623 
Time taken = 2.603 
Time taken = 2.977 
Time taken = 3.237 
Time taken = 2.748 
Time taken = 2.798 
Time taken = 2.811 
Time taken = 2.783 
Time taken = 2.585 
Time taken = 2.595 

Time taken = 1.503 
Time taken = 1.529 
Time taken = 1.509 
Time taken = 1.543 
Time taken = 1.427 
Time taken = 1.425 
Time taken = 1.423 
Time taken = 1.415 
Time taken = 1.414 
Time taken = 1.418 

我的問題是:爲什麼FUNC2幾乎是2倍速度較慢比func1和func3?

有沒有辦法改善這一點?

謝謝!

######## UPDATE

我真正的問題如下。我正在調用接受3D數組的函數(比如P [i,j,k])。函數將遍歷每個元素並計算幾個量:一個數量取決於該位置數組的值(比如A = f(P [i,j,k])),另一個量只取決於位置(B = g(i,j,k))。示意圖如下:

for i in xrange(N): 
    corr1 = h(i,val) 

    for j in xrange(N): 
     corr2 = h(j,val) 

     for k in xrange(N): 
      corr3 = h(k,val) 

      A = f(P[i,j,k]) 
      B = g(i,j,k) 
      Arr[B] += A*corr1*corr2*corr3 

其中val是由數字表示的3D數組的屬性。這個數字對於不同的領域可能是不同的。

由於我必須對許多3D數組進行這種操作,我認爲如果我創建一個接受許多不同輸入3D數組的新例程會更好,從而使數組的數量未知。這個想法是因爲B在所有數組中都是完全相同的,所以我可以避免爲每個數組計算它,只計算一次。問題是,CORR1,CORR2,corr3上面會成爲數組:

如果我有一些3D陣列等於num_3D_arrays我做的事情爲:

for i in xrange(N): 
    for p in xrange(num_3D_arrays): 
     corr1[p] = h(i,val[p]) 

    for j in xrange(N): 
     for p in xrange(num_3D_arrays): 
      corr2[p] = h(j,val[p]) 

     for k in xrange(N): 
      for p in xrange(num_3D_arrays): 
       corr3[p] = h(k,val[p]) 

      B = g(i,j,k) 
      for p in xrange(num_3D_arrays): 
       A[p] = f(P[i,j,k]) 
       Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p] 

所以VAL,我改變從標量到數組的變量corr1,corr2,corr3和A正在消除我期望避免執行大循環的性能。

+0

代碼範圍(N): sum_arr [0] + = i sum_arr [1] + = 2.0 * i sum_arr [2] + = 3.0 * i'忽略numpy擅長的所有內容。 Numpy不是很快,因爲你可以快速訪問索引,但是因爲它可以快速進行數字操作。但不是那樣。我建議讀入numpy –

+0

我想這很難讓它更快。因爲假如你固執地使用'numpy',你必須在該循環中創建numpy數組,並執行np.sum(),但創建numpy數組可能是該代碼片段中最慢的事情。我還建議分別檢查每條線,而不是這個簡單的時間。 ** [一些閱讀分析](http://stackoverflow.com/questions/582336/how-can-you-profile-a-script)** –

+0

好的謝謝!在我的情況下,問題是我不能像func1那樣定義單個變量,但是我需要定義一個我不知道先驗的大小的數組。有沒有不同的方式來做到這一點比使用numpy數組? – Francisco

回答

0
  • 爲什麼FUNC2幾乎是2倍比FUNC1慢?

    這是因爲索引會導致間接性,因此您將基本操作數加倍。計算總和像func1,然後影響與 sum=array([sum1,sum2,sum3])

  • 如何加快python代碼?

    1. numpy是第一個好主意,它不費吹灰之力就提高了近C的速度。

    2. 努巴可以不費吹灰之力,而且非常簡單。

    3. Cython for critical cases。

這裏是一些舉例說明的是:

# python way 
def func1(N): 
    sum1 = 0.0 
    sum2 = 0.0 
    sum3 = 0.0 

    for i in range(N): 
     sum1 += i 
     sum2 += 2.0*i 
     sum3 += 3.0*i 

    return sum1,sum2,sum3 

# numpy way 
def func2(N): 
    aran=arange(float(N)) 
    sum1=aran.sum() 
    sum2=(2.0*aran).sum() 
    sum3=(3.0*aran).sum() 
    return sum1,sum2,sum3 

#numba way 
import numba  
func3 =numba.njit(func1) 

""" 
In [609]: %timeit func1(10**6) 
1 loop, best of 3: 710 ms per loop 

In [610]: %timeit func2(1e6) 
100 loops, best of 3: 22.2 ms per loop 

In [611]: %timeit func3(10e6) 
100 loops, best of 3: 2.87 ms per loop 
""" 
+0

感謝您的回覆!我真正的問題比func1和func2更復雜,我只是用它來展示問題。我不完全理解的是爲什麼索引與numpy非常緩慢。如果我定義例程: – Francisco

+0

DEF FUNC3(長N): CDEF雙sum_arr [3] CDEF長我 sum_arr [0] = 0.0; sum_arr [01] = 0.0; sum_arr [2] =在範圍0.0 對於i(N): sum_arr [0] + = I sum_arr [1] + = 2.0 * I sum_arr [2] + = 3.0 * I 返回sum_arr – Francisco

+0

SUM1當sum_arr [i] + = 23是sum_arr [ref + i] + = 23時,+ 23 = 1ns成本2ns。它是一個麻煩的問題,這是一個組裝級別的問題。 –

0

看由cython -a ...pyx產生的html

對於func1,所述sum1 += i線擴展爲:

+15:   sum1 += i 
    __pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i); 

func3爲,具有C陣列

+45:   sum_arr[0] += i 
    __pyx_t_3 = 0; 
    (__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i); 

稍微複雜一些,但直向前c

但對於func2

+29:   sum_arr[0] += i 
    __pyx_t_12 = 0; 
    __pyx_t_6 = -1; 
    if (__pyx_t_12 < 0) { 
     __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape; 
     if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0; 
    } else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0; 
    if (unlikely(__pyx_t_6 != -1)) { 
     __Pyx_RaiseBufferIndexError(__pyx_t_6); 
     __PYX_ERR(0, 29, __pyx_L1_error) 
    } 
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i; 

複雜得多一起numpy功能(例如Pyx_BUfPtrStrided1d)的引用。偶數初始化數組是複雜的:

+26:  sum_arr = np.zeros(3,dtype=np.float64) 
    __pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error) 
    __Pyx_GOTREF(__pyx_t_1); 
    .... 

我希望移動sum_arr創建到調用Python和傳遞它作爲參數傳遞給func2會節省一些時間。

有閱讀本指南使用memoryviews

http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html

如果你專注於寫作的低級操作,使他們轉化爲簡單的c你會得到最好的性能cython。在

for k in xrange(N): 
     corr3 = h(k,val) 

     A = f(P[i,j,k]) 
     B = g(i,j,k) 
     Arr[B] += A*corr1*corr2*corr3 

它不是i,j,k環,將你慢下來。它每次評估h,fg以及Arr[B] +=...。這些函數應嚴格編碼cython,而不是一般的Python函數。請參閱memoryview指南中sum3d函數的編譯簡單性。

+0

謝謝!我現在明白它是怎麼回事。由@ user7138814提出的解決方案很好用 – Francisco

2

有幾個事情可以做,在用Cython加快數組索引:

所以對於你的函數:

@cython.boundscheck(False) 
@cython.wraparound(False) 
def func2(long N): 

    cdef np.float64_t[::1] sum_arr 
    cdef long i 

    sum_arr = np.zeros(3,dtype=np.float64) 

    for i in range(N): 
     sum_arr[0] += i 
     sum_arr[1] += 2.0*i 
     sum_arr[2] += 3.0*i 

    return sum_arr 

對於原始代碼用Cython產生用於線sum_arr[0] += i以下C代碼:

__pyx_t_12 = 0; 
__pyx_t_6 = -1; 
if (__pyx_t_12 < 0) { 
    __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape; 
    if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0; 
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0; 
if (unlikely(__pyx_t_6 != -1)) { 
    __Pyx_RaiseBufferIndexError(__pyx_t_6); 
    {__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;} 
} 
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i; 

通過上面的改進:

__pyx_t_8 = 0; 
*((double *) (/* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)))) += __pyx_v_i; 
+0

非常感謝!這確實有效,我獲得與func1和func2相同的速度。真棒! – Francisco