1
我想在cuda中使用動態並行。我處於這樣一種情況,即父內核有一個變量需要傳遞給子進行進一步計算。我已經通過資源網 here在cuda中將變量從父內核傳遞給子內核動態並行中
遠去又提到,局部變量不能傳遞給孩子籽粒並提到傳遞變量的方式和我試圖通過傳遞變量
#include <stdio.h>
#include <cuda.h>
__global__ void square(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(N==10)
{
a[idx] = a[idx] * a[idx];
}
}
// Kernel that executes on the CUDA device
__global__ void first(float *arr, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int n=N; // this value of n can be changed locally and need to be passed
printf("%d\n",n);
cudaMalloc((void **) &n, sizeof(int));
square <<< 1, N >>> (arr, n);
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
first <<< 1, 1 >>> (a_d, N);
//cudaThreadSynchronize();
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}
並且不傳遞父對子內核的值。我怎樣才能傳遞局部變量的值。有沒有辦法做到這一點?
是的,它沒有cudaMalloc工作。然而,在文檔中提到局部變量不能傳遞給子內核,但在上面的例子中,傳遞局部變量io運行良好。這怎麼可能? – Malacu 2014-09-09 02:42:33
如果將值*作爲內核參數傳遞給*,則可以將局部變量傳遞給子內核。該文件指出*指向本地變量的指針不應該被傳遞。你寫的代碼在'first'中通過值'n'傳遞。 – 2014-09-09 13:35:47