2015-10-05 80 views
-1

我正在嘗試完成udacity「並行編程介紹」課程的作業,並且我被困在第二個任務中,它基本上將高斯模糊蒙版應用於使用CUDA的圖像。 我想通過利用共享內存來有效地完成此操作。 我的想法是解決「在邊界問題的像素」問題,以便啓動比塊中像素的實際數量多的線程:例如,如果我將輸入圖像分成16x16大小的活動像素塊和I有一個9x9大小的面具,那麼我的實際塊尺寸將爲(對於x和y):16 + 2 *(9/2)= 24。這樣,我在一個塊中啓動24個線程,以便「 「線程將僅用於將像素從輸入img加載到共享內存,而」內部「線程則對應於實際執行計算的活動像素(另外還會在共享內存中進行緩存)。使用共享內存在cuda內核中應用高斯掩模

由於某種原因,它不起作用。從附加代碼中可以看到,我可以將像素緩存到共享內存中,但是在計算過程中出現了一些錯誤,並且附上我得到的糟糕結果的圖像。

   __global__ void gaussian_blur(const unsigned char* const inputChannel, 
       unsigned char* const outputChannel, 
       int numRows, int numCols, 
       const float* const filter, const int filterWidth) 
       { 

int filter_radius = (int)(filterWidth/2); //getting the filter "radius" 

int x = blockDim.x*blockIdx.x+threadIdx.x; 
int y = blockDim.y*blockIdx.y+threadIdx.y; 

if(x>=(numCols+filter_radius) || y>=(numRows+filter_radius)) 
    return; 

int px = x-filter_radius; 
int py = y-filter_radius; 

//clamping 

if(px<0) px = 0; 
if(py<0) py = 0; 
//if(px>=numCols) px = numCols-1; 
// if(py>=numRows) py = numRows-1; 

__shared__ unsigned char tile[(16+8)*(16+8)]; //16 active pixels + 2*filter_radius 

tile[threadIdx.y*24+threadIdx.x] = inputChannel[py*numCols+px]; 

__syncthreads(); 

//Here everything is working fine: if I do 
// outputChannel[py*numCols+px] = tile[threadIdx.y*24+threadIdx.x]; 
//then I am able to see the perfect reconstruction of the input image. 

//caching the filter 
__shared__ float t_filter[81]; //9x9 conv mask 

if(threadIdx.x==0 && threadIdx.y==0) 
{ 
    for(int i=0; i<81; i++) 
     t_filter[i] = filter[i]; 
} 

__syncthreads(); 


//I am checking the threadIdx of the threads and I am performing the mask computation 
//only to those threads that are pointing to active pixels: 
//i.e. all the threads whose id is greater or equal to the filter radius, 
//but smaller than the whole block of active pixels will perform the computation. 
//filter_radius = filterWidth/2 = 9/2 = 4 
//blockDim.x or y = 16 + filterWidth*2 = 16+8 = 24 
//active pixel index limit = filter_radius+16 = 4+16 = 20 
//is that correct? 


if( 
    threadIdx.y>=filter_radius && threadIdx.x>=filter_radius && 
    threadIdx.x < 20 && threadIdx.y < 20 
) 
{ 

    float value = 0.0; 

    for(int i=-filter_radius; i<=filter_radius; i++) 
     for(int j=-filter_radius; j<=filter_radius; j++) 
     { 
      int fx = i+filter_radius; 
      int fy = j+filter_radius; 

      int ty = threadIdx.y+i; 
      int tx = threadIdx.x+j; 

      value += ((float)tile[ty*24+tx])*t_filter[fy*filterWidth+fx]; 
     } 
    outputChannel[py*numCols+px] = (unsigned char) value; 
}  

輸出圖像:http://i.stack.imgur.com/EMu5M.png

編輯:添加內核調用:

int filter_radius = (int) (filterWidth/2); 
    blockSize.x = 16 + 2*filter_radius; 
    blockSize.y = 16 + 2*filter_radius; 
    gridSize.x = numCols/16+1; 
    gridSize.y = numRows/16+1; 

    printf("\n grx %d gry %d \n", blockSize.x, blockSize.y); 

    gaussian_blur<<<gridSize, blockSize>>>(d_red, d_redBlurred, numRows,numCols, d_filter, filterWidth); 
    gaussian_blur<<<gridSize, blockSize>>>(d_green, d_greenBlurred, numRows,numCols, d_filter, filterWidth); 
    gaussian_blur<<<gridSize, blockSize>>>(d_blue, d_blueBlurred, numRows,numCols, d_filter, filterWidth); 

    cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); 

    blockSize.x = 32; gridSize.x = numCols/32+1; 
    blockSize.y = 32; gridSize.y = numRows/32+1; 

    // Now we recombine your results. We take care of launching this kernel for you. 
    // 
    // NOTE: This kernel launch depends on the gridSize and blockSize variables, 
    // which you must set yourself. 
    recombineChannels<<<gridSize, blockSize>>>(d_redBlurred, 
              d_greenBlurred, 
              d_blueBlurred, 
              d_outputImageRGBA, 
              numRows, 
              numCols); 
    cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); 

編輯二:

所有其他必要的,以便編譯和運行可以在這裏找到代碼: https://github.com/udacity/cs344/tree/master/Problem%20Sets/Problem%20Set%202 以上內核應該在student_func.cu文件中編碼。

+1

從[這裏](http://stackoverflow.com/help/on-topic):「求調試幫助(問題:」爲什麼不是這個代碼工作?「)必須包括理想的行爲,特定的問題或錯誤,以及在問題本身中重現問題所需的最短代碼,沒有明確問題陳述的問題對其他讀者無益,參見:[如何創建最小,完整且可驗證的示例(MCVE)](http://stackoverflow.com/help/mcve)「。 CUDA內核本身不是MCVE。最好,你的MCVE應該是獨立的,並且不應該要求OpenCV或其他框架,或獨立的數據文件。 –

+0

對不起,因爲我在這裏瀏覽了一些CUDA的問題,沒有一個顯示整個事情。其中一些可能會顯示內核調用本身,但我很確定在處理圖像時,它們都沒有提供自己的函數來讀取和輸出圖像文件,因此避免使用OpenCV或其他框架。我正在添加內核調用併發布到編譯所需的其他文件的鏈接。我認爲這應該足夠了。至於這個代碼應該做什麼,我認爲這是很好解釋。 – alef0

回答

0

在您的實現中,每個塊永遠不會計算邊界(在邊緣的一個濾鏡半徑內)像素的模糊。這意味着你希望你的塊重疊,以便覆蓋邊界。如果你看一下x指數的網域,每個塊

int x = blockDim.x*blockIdx.x+threadIdx.x; 

給特定內核執行上面我們會有

blockIdx.x = 0: x = [0,23] 
blockIdx.x = 1: x = [24,46] 
... etc 

正如你可以看到每塊會考慮你的形象的一個獨特之處,但是你已經告訴每個塊不要在邊界上計算。這意味着每個塊的邊界從計算中被忽略(因此圖像中的黑色網格)。

你需要的東西來計算你的指數一樣

int x = (blockDim.x-2*filter_radius)*blockIdx.x+threadIdx.x; 

使塊重疊。現在,我們對我們的x指數域看起來像

blockIdx.x = 0: x = [0,23] 
blockIdx.x = 1: x = [16,39] 
... etc