0
我一直在玩弄numba並嘗試實現一個簡單的基於元素的矩陣乘法。當使用'vectorize'時,我會得到與numpy乘法相同的結果,但是當我使用'cuda.jit'時,它們不相同。其中許多是零。我爲此提供了一個最低工作示例。任何有關問題的幫助將不勝感激。我正在使用numba o.35.0和python 2.7無法獲得與使用numba的numpy元素矩陣乘法相同的值
from __future__ import division
from __future__ import print_function
import numpy as np
from numba import vectorize, cuda, jit
M = 80
N = 40
P = 40
# Set the number of threads in a block
threadsperblock = 32
# Calculate the number of thread blocks in the grid
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock
@vectorize(['float32(float32,float32)'], target='cuda')
def VectorMult3d(a, b):
return a*b
@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])')
def mult_gpu_3d(a, b, c):
[x, y, z] = cuda.grid(3)
if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]:
c[x, y, z] = a[x, y, z] * b[x, y, z]
if __name__ == '__main__':
A = np.random.normal(size=(M, N, P)).astype(np.float32)
B = np.random.normal(size=(M, N, P)).astype(np.float32)
numpy_C = A*B
A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu)
mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu)
cudajit_C = C_gpu.copy_to_host()
print('------- using cuda.jit -------')
print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C)))
print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P))
print('------- using cuda.jit -------\n')
vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu)
vectorize_C = vectorize_C_gpu.copy_to_host()
print('------- using vectorize -------')
print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C)))
print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P))
print('------- using vectorize -------\n')
import numba; print("numba version: "+numba.__version__)
感謝。你的解釋很清楚。我接受了使用多維內核網格配置的建議。像下面的東西。 'threadsperblock =(4,4,4); blockspergrid_x = np.int(np.ceil(M/threadsperblock [0]))' 同樣設置blockspergrid_y和blockspergrid_z,然後'blockspergrid =(blockspergrid_x,blockspergrid_y,blockspergrid_z)'。最後用'blockspergrid'和'threadsperblock'調用'mult_gpu_3d'。您提供的參考資料也很棒!再次感謝。 –