2017-10-11 99 views
0

我一直在玩弄numba並嘗試實現一個簡單的基於元素的矩陣乘法。當使用'vectorize'時,我會得到與numpy乘法相同的結果,但是當我使用'cuda.jit'時,它們不相同。其中許多是零。我爲此提供了一個最低工作示例。任何有關問題的幫助將不勝感激。我正在使用numba o.35.0和python 2.7無法獲得與使用numba的numpy元素矩陣乘法相同的值

from __future__ import division 
from __future__ import print_function 

import numpy as np 

from numba import vectorize, cuda, jit 

M = 80 
N = 40 
P = 40 

# Set the number of threads in a block 
threadsperblock = 32 

# Calculate the number of thread blocks in the grid 
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock 

@vectorize(['float32(float32,float32)'], target='cuda') 
def VectorMult3d(a, b): 
    return a*b 

@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])') 
def mult_gpu_3d(a, b, c): 
    [x, y, z] = cuda.grid(3) 
    if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]: 
    c[x, y, z] = a[x, y, z] * b[x, y, z] 

if __name__ == '__main__': 
    A = np.random.normal(size=(M, N, P)).astype(np.float32) 
    B = np.random.normal(size=(M, N, P)).astype(np.float32) 

    numpy_C = A*B 

    A_gpu = cuda.to_device(A) 
    B_gpu = cuda.to_device(B) 
    C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu) 

    mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu) 

    cudajit_C = C_gpu.copy_to_host() 

    print('------- using cuda.jit -------') 
    print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C))) 
    print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P)) 
    print('------- using cuda.jit -------\n') 


    vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu) 
    vectorize_C = vectorize_C_gpu.copy_to_host() 

    print('------- using vectorize -------') 
    print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C))) 
    print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P)) 
    print('------- using vectorize -------\n') 

    import numba; print("numba version: "+numba.__version__) 

回答

2

這裏是你如何調試這個。

考慮以較小的和簡化的例子:

  • 減少數組大小,例如(2,3,1)(所以你可以實際打印值並能夠讀取它們)
  • 簡單和確定性的內容,例如, 「所有的人」(跨運行比較)
  • 額外的內核參數進行調試

from __future__ import (division, print_function) 

import numpy as np 
from numba import cuda 

M = 2 
N = 3 
P = 1 

threadsperblock = 1 
blockspergrid = (M * N * P + (threadsperblock - 1)) // threadsperblock 


@cuda.jit 
def mult_gpu_3d(a, b, c, grid_ran, grid_multed): 
    grid = cuda.grid(3) 
    x, y, z = grid 

    grid_ran[x] = 1 

    if (x < c.shape[0]) and (y < c.shape[1]) and (z < c.shape[2]): 
     grid_multed[x] = 1 
     c[grid] = a[grid] * b[grid] 


if __name__ == '__main__': 
    A = np.ones((M, N, P), np.int32) 
    B = np.ones((M, N, P), np.int32) 

    A_gpu = cuda.to_device(A) 
    B_gpu = cuda.to_device(B) 
    C_gpu = cuda.to_device(np.zeros_like(A)) 

    # Tells whether thread at index i have ran 
    grid_ran = cuda.to_device(np.zeros([blockspergrid], np.int32)) 

    # Tells whether thread at index i have performed multiplication 
    grid_multed = cuda.to_device(np.zeros(blockspergrid, np.int32)) 

    mult_gpu_3d[blockspergrid, threadsperblock](
     A_gpu, B_gpu, C_gpu, grid_ran, grid_multed) 

    print("grid_ran.shape : ", grid_ran.shape) 
    print("grid_multed.shape : ", grid_multed.shape) 
    print("C_gpu.shape  : ", C_gpu.shape) 

    print("grid_ran   : ", grid_ran.copy_to_host()) 
    print("grid_multed  : ", grid_multed.copy_to_host()) 

    C = C_gpu.copy_to_host() 
    print("C transpose flat : ", C.T.flatten()) 
    print("C     : \n", C) 

輸出:

grid_ran.shape : (6,) 
grid_multed.shape : (6,) 
C_gpu.shape  : (2, 3, 1) 
grid_ran   : [1 1 1 1 1 1] 
grid_multed  : [1 1 0 0 0 0] 
C transpose flat : [1 1 0 0 0 0] 
C     : 
[[[1] 
    [0] 
    [0]] 

[[1] 
    [0] 
    [0]]] 

你可以看到,設備網格形狀不符合形狀陣列:網格是平坦的(M*N*P),而陣列都是3維的(M, N, P)。也就是說,網格的第一維的索引範圍爲0..M*N*P-10..5,本例中共計6個值),而數組的第一維僅在0..M-10..1,在我的示例中總計2個值)。這個錯誤通常會導致做出來的越界訪問,但你保護你的內核有一個條件,這就減少了違規線程:

以上 M-1指數
if (x <= c.shape[0]) 

此行不允許線程(1在我的例子)來運行(以及[1]),這就是爲什麼沒有值被寫入,並且在結果數組中得到很多零的原因。

可能的解決方案:

  • 一般而言,可以使用多維內核網格配置,即對於blockspergrid代替標量一個3D矢量[2]。
  • 特別是,因爲元素乘法是一個映射操作,並且不依賴於數組形狀,所以您可以將所有3個數組壓縮成1D數組,在1D網格上運行內核,然後重新設計結果[3],[ 4]。

參考文獻:

+0

感謝。你的解釋很清楚。我接受了使用多維內核網格配置的建議。像下面的東西。 'threadsperblock =(4,4,4); blockspergrid_x = np.int(np.ceil(M/threadsperblock [0]))' 同樣設置blockspergrid_y和blockspergrid_z,然後'blockspergrid =(blockspergrid_x,blockspergrid_y,blockspergrid_z)'。最後用'blockspergrid'和'threadsperblock'調用'mult_gpu_3d'。您提供的參考資料也很棒!再次感謝。 –