無法理解CUDA內核啓動的行爲

#include "utils.h" 

__global__ 
void rgba_to_greyscale(const uchar4* const rgbaImage, 
         unsigned char* const greyImage, 
         int numRows, int numCols) 
{ 
    for (size_t r = 0; r < numRows; ++r) { 
    for (size_t c = 0; c < numCols; ++c) { 
     uchar4 rgba = rgbaImage[r * numCols + c]; 
     float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z; 
     greyImage[r * numCols + c] = channelSum; 
    } 
    } 
} 

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage, 
          unsigned char* const d_greyImage, size_t numRows, size_t numCols) 
{ 
    const dim3 blockSize(1, 1, 1); //TODO 
    const dim3 gridSize(1, 1, 1); //TODO 
    rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols); 

    cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); 

}

這是用於將彩色圖像轉換爲灰度的代碼。我正在完成這門課程的作業，並在completing it之後得到了這些結果。無法理解CUDA內核啓動的行爲

A. 
blockSize = (1, 1, 1) 
gridSize = (1, 1, 1) 
Your code ran in: 34.772705 msecs. 

B. 
blockSize = (numCols, 1, 1) 
gridSize = (numRows, 1, 1) 
Your code ran in: 1821.326416 msecs. 

C. 
blockSize = (numRows, 1, 1) 
gridSize = (numCols, 1, 1) 
Your code ran in: 1695.917480 msecs. 

D. 
blockSize = (1024, 1, 1) 
gridSize = (170, 1, 1) [the image size is : r=313, c=557, blockSize*gridSize ~= r*c] 
Your code ran in: 1709.109863 msecs.

我已經嘗試了幾個組合，但沒有得到更好的性能比A.我差的只有幾納秒親近的小值增加塊大小和gridsize。例如：

blockSize = (10, 1, 1) 
gridSize = (10, 1, 1) 
Your code ran in: 34.835167 msecs.

我不明白爲什麼更高的數字沒有得到更好的性能，反而導致更糟糕的表現。此外，似乎增加塊大小比網格大小更好。

來源

2017-01-01 darth vader

您計算您啓動的每個線程中的所有像素，即內核是完全串行的。使用更多的塊或更大的塊只是重複計算。在後一種情況下，爲什麼不將for循環移出內核並讓每個線程計算一個像素？

來源

2017-01-01 11:06:09

無法理解CUDA內核啓動的行爲

回答

相關問題