2014-12-07 73 views
0

我正在努力解決一些內存管理問題。將結果複製到主機時,我一直收到「未指定的啓動失敗」。CUDA內存管理/類問題指針

我的代碼很簡單 - 它在每個線程中生成兩個提示並將它們相乘。 我有類提供一個隨機數:

class CuRandCuRandomNumberProvider : 
{ 
public: 
    CuRandCuRandomNumberProvider(dim3 numBlocks, dim3 threadsPerBlock); 
    CuRandCuRandomNumberProvider(dim3 numBlocks, dim3 threadsPerBlock, unsigned int seed); 
    __device__ unsigned int GetRandomNumber(); 
    ~CuRandCuRandomNumberProvider(); 
protected: 
    curandState * states; 
    __device__ bool IsPrime(unsigned int number); 
}; 

CuRandCuRandomNumberProvider::CuRandCuRandomNumberProvider(dim3 numBlocks, dim3 threadsPerBlock) 
{ 
    int numberOfThreads = threadsPerBlock.x * threadsPerBlock.y * numBlocks.x * numBlocks.y; 
    std::cout << numberOfThreads << std::endl; 
    cudaMalloc (&this->states, numberOfThreads*sizeof(curandState)); 
    setup_kernel <<< numBlocks, threadsPerBlock >>> (this->states, time(NULL)); 
} 

__device__ unsigned int CuRandCuRandomNumberProvider::GetRandomNumber() 
{ 
    int x = threadIdx.x + blockIdx.x * blockDim.x; 
    int y = threadIdx.y + blockIdx.y * blockDim.y; 
    int offset = x + y * blockDim.x * gridDim.x; 
    register float r = curand_uniform(&this->states[offset]); 
    return 0 + ((double)UINT_MAX) * r; 
} 

setup_kernel存儲在頭文件,看起來像這樣:

__global__ void setup_kernel (curandState * state, unsigned long seed) 
{ 
    int x = threadIdx.x + blockIdx.x * blockDim.x; 
    int y = threadIdx.y + blockIdx.y * blockDim.y; 
    int offset = x + y * blockDim.x * gridDim.x; 
    curand_init (seed, offset, 0, &state[offset]); 
} 

我的主要核心是非常簡單的,看起來像這樣:

​​

最後cudaMemcpy導致問題的主執行是:

uint3 * pqnD; 

uint3 * pqnH = (uint3*)malloc(sizeof(uint3) * numberOfThreads); 
memset(pqnH,0,sizeof(uint3) * numberOfThreads); 

HANDLE_ERROR(cudaMalloc((void**)&pqnD, sizeof(uint3) * numberOfThreads)); 

CuRandCuRandomNumberProvider * provider = new CuRandCuRandomNumberProvider(numBlocks, threadsPerBlock); 

InitKernel<<<numBlocks, threadsPerBlock>>>(pqnD, provider); 

HANDLE_ERROR(cudaMemcpy(pqnH, pqnD, sizeof(uint3) * numberOfThreads, cudaMemcpyDeviceToHost)); // this line causes error 

HANDLE_ERROR(cudaFree(pqnD)); 

如果我做的一切explicily,如:

uint3 * pqnD; 

uint3 * pqnH = (uint3*)malloc(sizeof(uint3) * numberOfThreads); 

memset(pqnH,0,sizeof(uint3) * numberOfThreads); 

HANDLE_ERROR(cudaMalloc((void**)&pqnD, sizeof(uint3) * numberOfThreads)); 

curandState * states; 

cudaMalloc (&states, numberOfThreads*sizeof(curandState)); 

setup_kernel <<< numBlocks, threadsPerBlock >>> (states, time(NULL)); 

CuRandCuRandomNumberProvider * provider = new CuRandCuRandomNumberProvider(numBlocks, threadsPerBlock, states); 


InitKernel2<<<numBlocks, threadsPerBlock>>>(pqnD, states); 

HANDLE_ERROR(cudaMemcpy(pqnH, pqnD, sizeof(uint3) * numberOfThreads, cudaMemcpyDeviceToHost)); 

HANDLE_ERROR(cudaFree(pqnD)); 

哪裏setup_kernel是完全一樣的,並InitKernel2樣子:

__global__ void InitKernel2(uint3 * ptr, curandState * states) 
{ 
    int x = threadIdx.x + blockIdx.x * blockDim.x; 
    int y = threadIdx.y + blockIdx.y * blockDim.y; 
    int offset = x + y * blockDim.x * gridDim.x; 

    ptr[offset].x = GetRandomNumber(states); 
    ptr[offset].y = GetRandomNumber(states); 
    ptr[offset].z =  ptr[offset].x *  ptr[offset].y; 
} 

和getRandomNumber的是:

__device__ unsigned int GetRandomNumber(curandState * states) 
{ 
    int x = threadIdx.x + blockIdx.x * blockDim.x; 
    int y = threadIdx.y + blockIdx.y * blockDim.y; 
    int offset = x + y * blockDim.x * gridDim.x; 
    register float r = curand_uniform(&states[offset]); 
    return 0 + ((double)UINT_MAX) * r; 

} 

一切正常作爲魅力。有沒有人有線索我做錯了什麼?我一直在掙扎幾個小時。我的事情可能是內存管理或指針傳遞的東西,但我不知道它會是什麼。

請幫忙:)!

+0

你應該爲這樣的問題提供一個MCVE。 – 2014-12-07 23:18:49

回答

1

這是非法的:

CuRandCuRandomNumberProvider * provider = new CuRandCuRandomNumberProvider(numBlocks, threadsPerBlock); 

InitKernel<<<numBlocks, threadsPerBlock>>>(pqnD, provider); 

provider是,你分配在主機上的變量。順便指出指針裝置和設備代碼解引用它:

ptr[offset].x = provider->GetRandomNumber(); 

(最終導致:)

register float r = curand_uniform(&this->states[offset]); 

是非法的。

由於您似乎想要在主機上設置對象(類CuRandCuRandomNumberProvider)並將其傳遞給設備,因此一種可能的解決方法是按值而不是指針傳遞對象。這將需要一些變化,主要:

CuRandCuRandomNumberProvider provider(numBlocks, threadsPerBlock); 
在InitKernel

__global__ void InitKernel(uint3 * ptr, CuRandCuRandomNumberProvider provider) // change 
{ 
    int x = threadIdx.x + blockIdx.x * blockDim.x; 
    int y = threadIdx.y + blockIdx.y * blockDim.y; 
    int offset = x + y * blockDim.x * gridDim.x; 

    ptr[offset].x = provider.GetRandomNumber(); // change 
    ptr[offset].y = provider.GetRandomNumber(); // change 
    ptr[offset].z = ptr[offset].x * ptr[offset].y; 
} 
在CuRandCuRandomNumberProvider

:: getRandomNumber的():

__device__ unsigned int CuRandCuRandomNumberProvider::GetRandomNumber() 
{ 
    int x = threadIdx.x + blockIdx.x * blockDim.x; 
    int y = threadIdx.y + blockIdx.y * blockDim.y; 
    int offset = x + y * blockDim.x * gridDim.x; 
    register float r = curand_uniform(&(states[offset])); // change 
    return 0 + ((double)UINT_MAX) * r; 
} 

(我刪除的析構函數原型太,因爲它正在妨礙你。)

+0

它的工作,但通過價值力量傳遞給每個線程複製這個值,並強制執行後刪除它,我是嗎?由於CuRandCuRandomNumberProvider對象包含一個長度爲線程數量的curandState數組,所以當數百萬個線程出現時,事情開始得到很少的時間和內存消耗:)我真的需要只有一個CuRandCuRandomNumberProvider實例,應該我開玩笑cudaMemcpy它到設備?或者使用常量內存? – pawels1991 2014-12-08 06:39:32

+1

'sizeof(CuRandCuRandomNumberProvider)'在我的64位機器上是8個字節,即完全是'states' *指針*的大小。該類/對象不包含'curandState'數組,它包含一個指向該數組的*指針*。並且該指針(就像任何其他內核參數一樣)將被複制*一次*並存儲在__constant__'內存中(在cc2.x和更新的設備上),所以我認爲沒有任何提高效率的機會。我不確定你關心的是什麼。沒有什麼可以「複製到每個線程」。無論如何,每個線程都會檢索指針。 – 2014-12-08 13:53:28

+0

無論如何,歡迎您使用任何你喜歡的方法來解決這個問題。你問「有誰知道我做錯了什麼?」你做錯了什麼是在主機上分配一個指針(使用'new')並在設備上取消引用該指針。我相信有很多方法可以解決這個問題,我只提出一個。我認爲我所提交的內容不存在任何重大問題,但如果您這樣做,請隨時使用其他方法。 – 2014-12-08 13:57:44