0
以下代碼無效。我的期望是在內核函數add()被調用後y [i]有3個。但是如果N> =(1 < < 24) - 255,所有的y [i]都是2(就像內核函數add()沒有運行一樣)。CUDA步幅功能不起作用
#include <iostream>
__global__ void add(int n, int *x, int *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride) y[i] = x[i] + y[i];
}
int main() {
int *x, *y, N = (1 << 24) - 255; // 255 wrong/256 ok
cudaMallocManaged(&x, N * sizeof(int));
cudaMallocManaged(&y, N * sizeof(int));
for (int i = 0; i < N; ++i) {x[i] = 1; y[i] = 2;}
int sz = 256;
dim3 blockDim(sz,1,1);
dim3 gridDim((N+sz-1)/sz,1,1);
add<<<gridDim, blockDim>>>(N, x, y);
cudaDeviceSynchronize();
for (int i = 0; i < N; ++i) if (y[i]!=3) std::cout << "error" << std::endl;
cudaFree(x);
cudaFree(y);
return 0;
}
的GPU是一個GTX1080Ti並具有以下限制:
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
機是X86_64 Linux操作系統Ubuntu 16.04。我在這裏做錯了什麼?請幫忙。
[適當CUDA錯誤檢查(https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda -runtime-api)將有助於將注意力集中在問題上 –
謝謝!我加了gpuErrchk(cudaPeekAtLastError());內核函數添加調用後。然後,當我在沒有-arch = sm_60的情況下編譯它時,它返回「GPUassert:invalid argument test.cu 42」。 – eii0000