CUDA點積

我正在做一個cuda教程，其中我必須製作兩個向量的點積。在實施教程中提供的解決方案後，我遇到了一些在this堆棧溢出帖子中解決的問題。無論我做什麼，現在我都收到答案0。貝婁你可以找到代碼！CUDA點積

#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 
#include "device_atomic_functions.h" 
#include <stdio.h> 
#include <stdlib.h> 
#define N (2048 * 8) 
#define THREADS_PER_BLOCK 512 

__global__ void dot(int *a, int *b, int *c) 
{ 
    __shared__ int temp[THREADS_PER_BLOCK]; 
    int index = threadIdx.x + blockIdx.x * blockDim.x; 
    temp[threadIdx.x] = a[index] * b[index]; 

    __syncthreads(); 

    if (threadIdx.x == 0) 
    { 
     int sum = 0; 
     for (int i = 0; i < N; i++) 
     { 
      sum += temp[i]; 
     } 
     atomicAdd(c, sum); 
    } 
} 

int main() 
{ 
    int *a, *b, *c; 
    int *dev_a, *dev_b, *dev_c; 
    int size = N * sizeof(int); 

    //allocate space for the variables on the device 
    cudaMalloc((void **)&dev_a, size); 
    cudaMalloc((void **)&dev_b, size); 
    cudaMalloc((void **)&dev_c, sizeof(int)); 

    //allocate space for the variables on the host 
    a = (int *)malloc(size); 
    b = (int *)malloc(size); 
    c = (int *)malloc(sizeof(int)); 

    //this is our ground truth 
    int sumTest = 0; 
    //generate numbers 
    for (int i = 0; i < N; i++) 
    { 
     a[i] = rand() % 10; 
     b[i] = rand() % 10; 
     sumTest += a[i] * b[i]; 
     printf(" %d %d \n",a[i],b[i]); 
    } 

    *c = 0; 

    cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice); 
    cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice); 
    cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice); 

    dot<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >> >(dev_a, dev_b, dev_c); 

    cudaMemcpy(c, dev_c, sizeof(int), cudaMemcpyDeviceToHost); 

    printf("%d ", *c); 
    printf("%d ", sumTest); 

    free(a); 
    free(b); 
    free(c); 

    cudaFree(a); 
    cudaFree(b); 
    cudaFree(c); 

    system("pause"); 

    return 0; 

}

來源

2015-10-06 Muresan Mircea Paul

如果您在代碼中添加了[錯誤檢查]（http://stackoverflow.com/a/14038590/1231073），您會立即發現您正在將額外的內存複製到'dev_c'中，並且在內核，你的'__shared__'內存訪問超出'for'循環的範圍。 – sgarizvi

你是對的__shared__內存這（爲（INT我= 0;我

首先，請在代碼中添加CUDA錯誤檢查，如this legendary post中所述。

就在內核執行呼叫之前，您要複製額外的內存爲以下行：

cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice);

它應該是：

cudaMemcpy(dev_c, c, sizeof(int), cudaMemcpyHostToDevice);

在代碼中的另一個錯誤是，裏面的內核，__shared__內存變量temp正被訪問超出for循環的界限。當循環迭代到N時，共享內存的元素數量等於THREADS_PER_BLOCK。只需在循環中將N替換爲THREADS_PER_BLOCK即可。

來源

2015-10-06 11:09:38 sgarizvi

是的，我看到謝謝你！ –

回答

相關問題