2016-12-28 53 views
0

我有一個要由OpenACC加速的任務。我需要在內核計算中進行動態內存分配。我爲它構建了一個更簡單的演示,如下所示。openacc在內核中運行時創建數據

#include <iostream> 

using namespace std; 

#pragma acc routine seq 
int *routine(int init) { 
    int *ptr; 
    #pragma acc data create(ptr[:10]) 
    for (int i = 0; i < 10; ++i) { 
     ptr[i] = init + i; 
    } 
    return ptr; 
} 

void print_array(int *arr) { 
    for (int i = 0; i < 10; ++i) { 
     cout << arr[i] << " "; 
    } 
    cout << endl; 
} 

int main(void) { 
    int *arrs[5]; 

#pragma acc kernels 
    for (int i = 0; i < 5; ++i) { 
     arrs[i] = routine(i); 
    } 

    for (int i = 0; i < 5; ++i) { 
     print_array(arrs[i]); 
    } 
    return 0; 
} 

在這個演示中,我試圖在內核結構內運行時調用例程。例行程序希望在GPU內創建一些數據並將其中的一些值添加進去。

儘管我可以編譯代碼,但它會將運行時問題報告如下。

[email protected]:create_and_copyout$ pgc++ -o test main.cc -acc -Minfo=accel 
routine(int): 
     6, Generating acc routine seq 
main: 
    23, Generating implicit copyout(arrs[:]) 
    26, Accelerator restriction: size of the GPU copy of arrs is unknown 
     Loop is parallelizable 
     Generating implicit copy(arrs[:][:]) 
     Accelerator kernel generated 
     Generating Tesla code 
     26, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */ 
[email protected]:create_and_copyout$ ./test 
call to cuStreamSynchronize returned error 715: Illegal instruction 

我在想我應該怎麼做才能完成這個任務(在內核構造的處理中動態分配內存)。真的很感謝你,如果你能幫助。

回答

0

這是未經測試的,可能非常緩慢,但這可能會做你所需要的。

int main() { 
    const int num = 20; 
    int a[x] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}; 
    int* sizes = (int *)malloc(num * sizeof(int)); 
    int *ptrs[num]; 
    int* temp, *temp2; 
    int sum; 
    int* finished = (int *)malloc(num * sizeof(int)); 
    for (int x = 0; x < num; ++x){ 
     finished[x] = 0; 
    } 
    #pragma acc kernels copyin(a[0:10]) copyout(ptrs[:num][:1]) async(num*2+1) 
    { 
     #pragma acc loop private(temp) 
     for (int i = 0; i < num; ++i){ 
      #pragma acc loop seq async(i) 
      for (int j = 0; j < 1; ++j){ 
       temp = ptrs[x]; 
       sizes[i] = ... 
      } 
      while (ptrs[x] != x); 
      ptrs[x] = routine(a, sizes[i]); 
     } 
    } 

    while (true){ 
     sum = 0; 
     for (int x = 0; x < num; ++x){ 
      sum += finished[x]; 
     } 
     if (sum == num){ 
      break; 
     } 
     for (int x = 0; x < num; ++x){ 
      if (acc_async_test(x) != 0 && finished[x] == 0){ 
       finished[x] = 1; 
       #pragma acc update host(sizes[x:1]) 
       temp = (int *)malloc(size[x] * sizeof(int)); 
       #pragma acc enter data copyin(temp[0:x]) 
       temp2 = acc_deviceptr(temp); 
       ptrs[x] = temp2; 
       #pragma acc update device(ptrs[x:1][0:1]) 
      } 
     } 
    } 
} 
+0

謝謝凱爾。但是,性能至關重要。我正在考慮放棄一些結果,只保留最好的結果。 –

相關問題