無法獲得與使用numba的numpy元素矩陣乘法相同的值

我一直在玩弄numba並嘗試實現一個簡單的基於元素的矩陣乘法。當使用'vectorize'時，我會得到與numpy乘法相同的結果，但是當我使用'cuda.jit'時，它們不相同。其中許多是零。我爲此提供了一個最低工作示例。任何有關問題的幫助將不勝感激。我正在使用numba o.35.0和python 2.7無法獲得與使用numba的numpy元素矩陣乘法相同的值

from __future__ import division 
from __future__ import print_function 

import numpy as np 

from numba import vectorize, cuda, jit 

M = 80 
N = 40 
P = 40 

# Set the number of threads in a block 
threadsperblock = 32 

# Calculate the number of thread blocks in the grid 
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock 

@vectorize(['float32(float32,float32)'], target='cuda') 
def VectorMult3d(a, b): 
    return a*b 

@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])') 
def mult_gpu_3d(a, b, c): 
    [x, y, z] = cuda.grid(3) 
    if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]: 
    c[x, y, z] = a[x, y, z] * b[x, y, z] 

if __name__ == '__main__': 
    A = np.random.normal(size=(M, N, P)).astype(np.float32) 
    B = np.random.normal(size=(M, N, P)).astype(np.float32) 

    numpy_C = A*B 

    A_gpu = cuda.to_device(A) 
    B_gpu = cuda.to_device(B) 
    C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu) 

    mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu) 

    cudajit_C = C_gpu.copy_to_host() 

    print('------- using cuda.jit -------') 
    print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C))) 
    print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P)) 
    print('------- using cuda.jit -------\n') 


    vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu) 
    vectorize_C = vectorize_C_gpu.copy_to_host() 

    print('------- using vectorize -------') 
    print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C))) 
    print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P)) 
    print('------- using vectorize -------\n') 

    import numba; print("numba version: "+numba.__version__)

來源

2017-10-11 A Das

這裏是你如何調試這個。

考慮以較小的和簡化的例子：

減少數組大小，例如（2,3,1）（所以你可以實際打印值並能夠讀取它們）
簡單和確定性的內容，例如，「所有的人」（跨運行比較）
額外的內核參數進行調試

from __future__ import (division, print_function) 

import numpy as np 
from numba import cuda 

M = 2 
N = 3 
P = 1 

threadsperblock = 1 
blockspergrid = (M * N * P + (threadsperblock - 1)) // threadsperblock 


@cuda.jit 
def mult_gpu_3d(a, b, c, grid_ran, grid_multed): 
    grid = cuda.grid(3) 
    x, y, z = grid 

    grid_ran[x] = 1 

    if (x < c.shape[0]) and (y < c.shape[1]) and (z < c.shape[2]): 
     grid_multed[x] = 1 
     c[grid] = a[grid] * b[grid] 


if __name__ == '__main__': 
    A = np.ones((M, N, P), np.int32) 
    B = np.ones((M, N, P), np.int32) 

    A_gpu = cuda.to_device(A) 
    B_gpu = cuda.to_device(B) 
    C_gpu = cuda.to_device(np.zeros_like(A)) 

    # Tells whether thread at index i have ran 
    grid_ran = cuda.to_device(np.zeros([blockspergrid], np.int32)) 

    # Tells whether thread at index i have performed multiplication 
    grid_multed = cuda.to_device(np.zeros(blockspergrid, np.int32)) 

    mult_gpu_3d[blockspergrid, threadsperblock](
     A_gpu, B_gpu, C_gpu, grid_ran, grid_multed) 

    print("grid_ran.shape : ", grid_ran.shape) 
    print("grid_multed.shape : ", grid_multed.shape) 
    print("C_gpu.shape  : ", C_gpu.shape) 

    print("grid_ran   : ", grid_ran.copy_to_host()) 
    print("grid_multed  : ", grid_multed.copy_to_host()) 

    C = C_gpu.copy_to_host() 
    print("C transpose flat : ", C.T.flatten()) 
    print("C     : \n", C)

輸出：

grid_ran.shape : (6,) 
grid_multed.shape : (6,) 
C_gpu.shape  : (2, 3, 1) 
grid_ran   : [1 1 1 1 1 1] 
grid_multed  : [1 1 0 0 0 0] 
C transpose flat : [1 1 0 0 0 0] 
C     : 
[[[1] 
    [0] 
    [0]] 

[[1] 
    [0] 
    [0]]]

你可以看到，設備網格形狀不符合形狀陣列：網格是平坦的(M*N*P)，而陣列都是3維的(M, N, P)。也就是說，網格的第一維的索引範圍爲0..M*N*P-1（0..5，本例中共計6個值），而數組的第一維僅在0..M-1（0..1，在我的示例中總計2個值）。這個錯誤通常會導致做出來的越界訪問，但你保護你的內核有一個條件，這就減少了違規線程：

以上 M-1指數

if (x <= c.shape[0])

此行不允許線程（1在我的例子）來運行（以及[1]），這就是爲什麼沒有值被寫入，並且在結果數組中得到很多零的原因。

可能的解決方案：

一般而言，可以使用多維內核網格配置，即對於blockspergrid代替標量一個3D矢量[2]。
特別是，因爲元素乘法是一個映射操作，並且不依賴於數組形狀，所以您可以將所有3個數組壓縮成1D數組，在1D網格上運行內核，然後重新設計結果[3]，[ 4]。

參考文獻：

來源

2017-10-11 21:52:29 Drop

感謝。你的解釋很清楚。我接受了使用多維內核網格配置的建議。像下面的東西。 'threadsperblock =（4，4，4）; blockspergrid_x = np.int（np.ceil（M/threadsperblock [0]））' 同樣設置blockspergrid_y和blockspergrid_z，然後'blockspergrid =（blockspergrid_x，blockspergrid_y，blockspergrid_z）'。最後用'blockspergrid'和'threadsperblock'調用'mult_gpu_3d'。您提供的參考資料也很棒！再次感謝。 –

無法獲得與使用numba的numpy元素矩陣乘法相同的值

回答

相關問題