我一直在玩一個簡單的CUDA程序,它只消除全局內存。下面是設備代碼,以及主機代碼:CUDA地址超出界限
#include <stdio.h>
__global__ void kernel(float *data, int width) {
int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
if (x > (width-1)) {
printf("x = %d\n", x);
printf("blockDim.x = %d\n", blockDim.x);
printf("blockIdx.x = %d\n", blockIdx.x);
printf("threadIdx.x = %d\n", threadIdx.x);
}
if (y > (width-1)) {
printf("y = %d\n", y);
printf("blockDim.y = %d\n", blockDim.y);
printf("blockIdx.y = %d\n", blockIdx.y);
printf("threadIdx.y = %d\n", threadIdx.y);
}
data[y * width + x] = 0.0;
}
int main(void) {
const int MATRIX_SIZE = 256;
float *data, *dataGPU;
int sizeOfMem;
int x = MATRIX_SIZE;
int y = MATRIX_SIZE;
cudaDeviceReset();
cudaDeviceSynchronize();
sizeOfMem = sizeof(float) * x * y;
data = (float *)malloc(sizeOfMem);
cudaMalloc((void **)&dataGPU, sizeOfMem);
cudaMemcpy(dataGPU, data, sizeOfMem, cudaMemcpyHostToDevice);
//int threads = 256;
//int blocks = ((x * y) + threads - 1)/threads;
dim3 threads(16, 16);
dim3 blocks(x/16, y/16);
kernel<<<blocks, threads>>>(dataGPU, MATRIX_SIZE);
cudaThreadSynchronize();
cudaMemcpy(data, dataGPU, sizeOfMem, cudaMemcpyDeviceToHost);
cudaFree(dataGPU);
free(data);
return 0;
}
我繼續接收地址越界錯誤信息與CUDA的MEMCHECK運行我的代碼的時候。但是,只有矩陣我創建的維度是128或更大。如果我的維度小於128,那麼錯誤頻率就會降低(我幾乎從不會收到錯誤)。您可能會注意到我在我的內核函數中包含了打印語句。只有當我收到錯誤消息時纔會打印這些語句,因爲x和y不應該大於width-1,或者在這種情況下爲255.如果我正確地完成了我的數學運算(我相信自己有這個數字),這種說法是正確的。下面是我從CUDA-MEMCHECK收到一條錯誤消息:
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 4
========= at 0x00000298 in kernel(float*, int)
========= by thread (3,10,0) in block (15,1,0)
========= Address 0x2300da6bcc is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x472225]
========= Host Frame:./test_reg_memory [0x16c41]
========= Host Frame:./test_reg_memory [0x31453]
========= Host Frame:./test_reg_memory [0x276d]
========= Host Frame:./test_reg_memory [0x24f0]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./test_reg_memory [0x25cd]
=========
y = 2074
blockDim.y = 16
blockIdx.y = 1
threadIdx.y = 10
這個輸出是沒有意義的我,因爲如果我做數學題,
y = blockDim.y * blockIdx.y + threadIdx.y = 16 * 1 + 10 = 26 (not 2074)
我花了一些時間在CUDA編程論壇,似乎沒有任何幫助。我讀過一個線程,表明我可能會損壞寄存器內存。然而,開始線程的這個問題與不同的GPU有關。線程有點不相關,但我總是包含鏈接。
下面我已經包括NVCC版本。
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
此外,這裏是我正在使用的GPU。
Device 0: "GeForce GT 640"
CUDA Driver Version/Runtime Version 8.0/7.5
CUDA Capability Major/Minor version number: 3.0
任何有CUDA經驗的人都會指出我可能做錯了什麼嗎?
您發佈的代碼對我來說運行正常,並且不會在cuda-memcheck中產生任何錯誤。如果您從SO問題中複製粘貼,編譯並運行它,您真的確定發佈的代碼會提供cuda-memcheck錯誤嗎? – talonmies
cudaMalloc是否成功? –
@RegisPortalez:如果cudaMalloc失敗,cuda-memcheck會報告錯誤。發佈的輸出不包含此類錯誤。 – talonmies