2015-10-06 63 views
2

我正在做一個cuda教程,其中我必須製作兩個向量的點積。在實施教程中提供的解決方案後,我遇到了一些在this堆棧溢出帖子中解決的問題。 無論我做什麼,現在我都收到答案0。 貝婁你可以找到代碼!CUDA點積

#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 
#include "device_atomic_functions.h" 
#include <stdio.h> 
#include <stdlib.h> 
#define N (2048 * 8) 
#define THREADS_PER_BLOCK 512 

__global__ void dot(int *a, int *b, int *c) 
{ 
    __shared__ int temp[THREADS_PER_BLOCK]; 
    int index = threadIdx.x + blockIdx.x * blockDim.x; 
    temp[threadIdx.x] = a[index] * b[index]; 

    __syncthreads(); 

    if (threadIdx.x == 0) 
    { 
     int sum = 0; 
     for (int i = 0; i < N; i++) 
     { 
      sum += temp[i]; 
     } 
     atomicAdd(c, sum); 
    } 
} 

int main() 
{ 
    int *a, *b, *c; 
    int *dev_a, *dev_b, *dev_c; 
    int size = N * sizeof(int); 

    //allocate space for the variables on the device 
    cudaMalloc((void **)&dev_a, size); 
    cudaMalloc((void **)&dev_b, size); 
    cudaMalloc((void **)&dev_c, sizeof(int)); 

    //allocate space for the variables on the host 
    a = (int *)malloc(size); 
    b = (int *)malloc(size); 
    c = (int *)malloc(sizeof(int)); 

    //this is our ground truth 
    int sumTest = 0; 
    //generate numbers 
    for (int i = 0; i < N; i++) 
    { 
     a[i] = rand() % 10; 
     b[i] = rand() % 10; 
     sumTest += a[i] * b[i]; 
     printf(" %d %d \n",a[i],b[i]); 
    } 

    *c = 0; 

    cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice); 
    cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice); 
    cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice); 

    dot<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >> >(dev_a, dev_b, dev_c); 

    cudaMemcpy(c, dev_c, sizeof(int), cudaMemcpyDeviceToHost); 

    printf("%d ", *c); 
    printf("%d ", sumTest); 

    free(a); 
    free(b); 
    free(c); 

    cudaFree(a); 
    cudaFree(b); 
    cudaFree(c); 

    system("pause"); 

    return 0; 

} 
+2

如果您在代碼中添加了[錯誤檢查](http://stackoverflow.com/a/14038590/1231073),您會立即發現您正​​在將額外的內存複製到'dev_c'中,並且在內核,你的'__shared__'內存訪問超出'for'循環的範圍。 – sgarizvi

+0

你是對的__shared__內存這(爲(INT我= 0;我

回答

3

首先,請在代碼中添加CUDA錯誤檢查,如this legendary post中所述。

就在內核執行呼叫之前,您要複製額外的內存爲​​以下行:

cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice); 

它應該是:

cudaMemcpy(dev_c, c, sizeof(int), cudaMemcpyHostToDevice); 

在代碼中的另一個錯誤是,裏面的內核,__shared__內存變量temp正被訪問超出for循環的界限。當循環迭代到N時,共享內存的元素數量等於THREADS_PER_BLOCK。只需在循環中將N替換爲THREADS_PER_BLOCK即可。

+0

是的,我看到謝謝你! –