在純NumPy中重寫for循環以減少執行時間

I recently asked about trying to optimise a Python loop for a scientific application，並且收到我的an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100！在純NumPy中重寫for循環以減少執行時間

但是，B值的計算實際上嵌套在其他幾個循環中，因爲它是在常規的位置網格中進行計算的。是否有類似智能的NumPy重寫來縮短這個過程的時間？

我懷疑這個部分的性能增益不會很明顯，並且其缺點大概是不可能向用戶報告計算進度，結果不能寫入輸出文件直到計算結束，並且可能在一個巨大的步驟中這樣做會產生內存影響？是否有可能繞過這些？你可以做

import numpy as np 
import time 

def reshape_vector(v): 
    b = np.empty((3,1)) 
    for i in range(3): 
     b[i][0] = v[i] 
    return b 

def unit_vectors(r): 
    return r/np.sqrt((r*r).sum(0)) 

def calculate_dipole(mu, r_i, mom_i): 
    relative = mu - r_i 
    r_unit = unit_vectors(relative) 
    A = 1e-7 

    num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i) 
    den = np.sqrt(np.sum(relative*relative, 0))**3 
    B = np.sum(num/den, 1) 
    return B 

N = 20000 # number of dipoles 
r_i = np.random.random((3,N)) # positions of dipoles 
mom_i = np.random.random((3,N)) # moments of dipoles 
a = np.random.random((3,3)) # three basis vectors for this crystal 
n = [10,10,10] # points at which to evaluate sum 
gamma_mu = 135.5 # a constant 

t_start = time.clock() 
for i in range(n[0]): 
    r_frac_x = np.float(i)/np.float(n[0]) 
    r_test_x = r_frac_x * a[0] 
    for j in range(n[1]): 
     r_frac_y = np.float(j)/np.float(n[1]) 
     r_test_y = r_frac_y * a[1] 
     for k in range(n[2]): 
      r_frac_z = np.float(k)/np.float(n[2]) 
      r_test = r_test_x +r_test_y + r_frac_z * a[2] 
      r_test_fast = reshape_vector(r_test) 
      B = calculate_dipole(r_test_fast, r_i, mom_i) 
      omega = gamma_mu*np.sqrt(np.dot(B,B)) 
      # write r_test, B and omega to a file 
    frac_done = np.float(i+1)/(n[0]+1) 
    t_elapsed = (time.clock()-t_start) 
    t_remain = (1-frac_done)*t_elapsed/frac_done 
    print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'

來源

2010-04-07 Statto

一個明顯的事情是與替代線

r_test_fast = reshape_vector(r_test)

r_test_fast = r_test.reshape((3,1))

可能不會做出任何性能上的很大差異，但在任何情況下，它是有道理的使用numpy builtin而不是重新發明輪子。

一般來說，正如您現在可能已經注意到的那樣，優化numpy的技巧是使用numpy全數組操作來表示算法，或者至少使用切片而不是遍歷Python代碼中的每個元素。趨於防止這種「矢量化」的是所謂的循環攜帶依賴性，即循環，其中每次迭代取決於先前迭代的結果。簡單地看一下你的代碼，你就沒有這種東西，應該可以將代碼向量化。

編輯：一種解決方案

我還沒有證實這是正確的，但應該給你如何對待它的想法。

首先，取cartesian() function, which we'll use。然後

 

def calculate_dipole_vect(mus, r_i, mom_i): 
    # Treat each mu sequentially 
    Bs = [] 
    omega = [] 
    for mu in mus: 
     rel = mu - r_i 
     r_norm = np.sqrt((rel * rel).sum(1)) 
     r_unit = rel/r_norm[:, np.newaxis] 
     A = 1e-7 

     num = A*(3*np.sum(mom_i * r_unit, 0)*r_unit - mom_i) 
     den = r_norm ** 3 
     B = np.sum(num/den[:, np.newaxis], 0) 
     Bs.append(B) 
     omega.append(gamma_mu * np.sqrt(np.dot(B, B))) 
    return Bs, omega 


# Transpose to get more "natural" ordering with row-major numpy 
r_i = r_i.T 
mom_i = mom_i.T 

t_start = time.clock() 
r_frac = cartesian((np.arange(n[0])/float(n[0]), 
        np.arange(n[1])/float(n[1]), 
        np.arange(n[2])/float(n[2]))) 
r_test = np.dot(r_frac, a) 
B, omega = calculate_dipole_vect(r_test, r_i, mom_i) 

print 'Total time for vectorized: %f s' % (time.clock() - t_start)

那麼，在我的測試中，這實際上比我開始的基於循環的方法稍慢。問題是，在問題的原始版本中，它已經通過整形陣列（20000，3）的全數組操作進行了矢量化，因此任何進一步的矢量化都沒有帶來太多的好處。事實上，如上所述，這可能會使性能惡化，這可能是由於大型臨時陣列造成的。

來源

2010-04-07 13:38:17 janneb

我認爲賈斯汀對配置文件的建議可能是明智的，但非常感謝......雖然我不確定我會使用它，但我認爲試圖理解這個例子可能是一種非常好的學習方式。 :) – Statto 2010-04-07 16:10:52

如果你的代碼是profile，你會發現99％的運行時間在calculate_dipole之內，所以減少這個循環的時間確實不會顯着減少執行時間。如果你想讓這個更快，你仍然需要關注calculate_dipole。我在這方面嘗試了我的Cython代碼calculate_dipole，並在總體時間內減少了大約2倍。也可能有其他方法來改進Cython代碼。

來源

2010-04-07 15:18:37

在純NumPy中重寫for循環以減少執行時間

回答

相關問題