2011-08-17 70 views
1

我想阻止一些塊,直到一個變量設置爲特定值。所以我編寫這段代碼來測試一個簡單的do-while循環是否可行。是否有方法能夠阻止一些塊,直到滿足一些條件?

__device__ int tag = 0; 
__global__ void kernel() { 
    if (threadIdx.x == 0) { 
     volatile int v; 
     do { 
      v = tag; 
     } 
     while (v == 0); 
    } 
    __syncthreads(); 
    return ; 
} 

然而,它不起作用(沒有死循環發生,很奇怪)。

我想問是否有其他方法能夠阻止某些塊,直到某些條件滿足或者代碼的某些更改能夠工作。

回答

3

目前還沒有可靠的方法在CUDA中執行塊間同步。

有一些方法可以在塊之間實現某種方式的鎖定或阻塞,但是它們會利用執行模型中未定義的行爲,這些行爲不保證在所有硬件上以相同的方式運行,或者繼續在未來工作。確保塊之間同步或阻塞的唯一可靠方法是分離內核啓動。如果你不能在沒有塊間同步的情況下使你的算法工作,你可能需要一個新的算法,或者你的應用程序對GPU架構的適應性很差。

+0

使用原子讀/寫沒有幫助? –

+0

不是。我暗指使用原子的哈克方法,但它們都依賴於在執行條件分支時知道執行順序。而且它們不是通用的,因爲它們依賴於網格中的每個塊被調度和激活,這與硬件有關。 – talonmies

+0

那麼你可以深入瞭解爲什麼我提交的代碼在5795+開始失敗?由於你提到的任何原因,不知道它是失敗的。 –

0

這是一種駭人聽聞的方式,我試圖看看它是否會起作用。

#include <stdio.h> 
#include <cuda.h> 
#include <cuda_runtime.h> 
#include <cuda_runtime_api.h> 

__global__ static 
void kernel(int *count, float *data) 
{ 
    count += threadIdx.x; 
    data += gridDim.x * threadIdx.x; 
    int i = blockIdx.x; 
    if (i < gridDim.x - 1) { 
     data[i] = i + 1; 
     atomicAdd(count, 1); 
     return; 
    } 

    while (atomicMin(count, i) != i); 

    float tmp = i + 1; 
    for (int j = 0; j < i; j++) tmp += data[j]; 

    data[i] = tmp; 
} 

int main(int argc, char **args) 
{ 
     int num = 100; 
    if (argc >= 2) num = atoi(args[1]); 

    int bytes = num * sizeof(float) * 32; 
    float *d_data; cudaMalloc((void **)&d_data, bytes); 
    float *h_data = (float *)malloc(bytes); 
    for (int i = 0; i < 32 * num; i++) h_data[i] = -1; // Being safe                               

    int h_count[32] = {1}; 
    int *d_count; cudaMalloc((void **)&d_count, 32 * sizeof(int)); 
    cudaMemcpy(d_count, &h_count, 32 * sizeof(int), cudaMemcpyHostToDevice); 
    cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice); 
    kernel<<<num, 32>>>(d_count, d_data); 
    cudaMemcpy(&h_count, d_count, 32 * sizeof(int), cudaMemcpyDeviceToHost); 
    cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost); 

    for (int i = 0; i < 32; i++) { 
     printf("sum of first %d from thread %d is %d \n", num, i, (int)h_data[num -1]); 
     h_data += num; 
    } 

    cudaFree(d_count); 
    cudaFree(d_data); 
    free(h_data - num * 32); 
} 

我不能保證這將始終有效。但是,我的卡(320M)的突破點似乎是爲數= 5796.也許每種卡的某種硬件限制不同?

EDIT

這個問題的答案是是n *(N + 1)/ 2> 2^24對於n> 5795(其是單精度極限)。超出此點的整數值的準確性未定義。感謝talonmies指出它。

./a.out 5795 
sum of first 5795 from thread 0 is 16793910 
sum of first 5795 from thread 1 is 16793910 
sum of first 5795 from thread 2 is 16793910 
sum of first 5795 from thread 3 is 16793910 
sum of first 5795 from thread 4 is 16793910 
sum of first 5795 from thread 5 is 16793910 
sum of first 5795 from thread 6 is 16793910 
sum of first 5795 from thread 7 is 16793910 
sum of first 5795 from thread 8 is 16793910 
sum of first 5795 from thread 9 is 16793910 
sum of first 5795 from thread 10 is 16793910 
sum of first 5795 from thread 11 is 16793910 
sum of first 5795 from thread 12 is 16793910 
sum of first 5795 from thread 13 is 16793910 
sum of first 5795 from thread 14 is 16793910 
sum of first 5795 from thread 15 is 16793910 
sum of first 5795 from thread 16 is 16793910 
sum of first 5795 from thread 17 is 16793910 
sum of first 5795 from thread 18 is 16793910 
sum of first 5795 from thread 19 is 16793910 
sum of first 5795 from thread 20 is 16793910 
sum of first 5795 from thread 21 is 16793910 
sum of first 5795 from thread 22 is 16793910 
sum of first 5795 from thread 23 is 16793910 
sum of first 5795 from thread 24 is 16793910 
sum of first 5795 from thread 25 is 16793910 
sum of first 5795 from thread 26 is 16793910 
sum of first 5795 from thread 27 is 16793910 
sum of first 5795 from thread 28 is 16793910 
sum of first 5795 from thread 29 is 16793910 
sum of first 5795 from thread 30 is 16793910 
sum of first 5795 from thread 31 is 16793910 

-

我編輯這是隻使用一個塊我以前的代碼。這是更真實世界的線程/塊的代表(內存訪問是奇怪的,並且會像地獄一樣慢,但他們完成了快速移植我的舊測試代碼以使用多線程)。

看起來有些情況下你可以在塊之間進行同步,但主要取決於你事先知道某些事情(對於這種特殊情況,我只在同步n-1個塊之前對最後一個進行了瘋狂無用的計數塊)。

這是一個概念證明而已,不採取認真代碼

+0

@Kun,如果你確實嘗試了這個,記得它會像地獄一樣緩慢。我同意talonmies,你需要使用不同的算法或使用兩個內核。只是爲了知識而做了這個實驗。 –

+0

您看到的5796限制是一個單精度浮點僞像,而不是任何與硬件有關的參數 - 16777216 = 2^24 = IEEE 32位浮點的24位尾數限制。我沒有看到你的例子是當你每個塊只運行1個線程時是真實世界同步的有效演示...... – talonmies

+0

啊忘了!困腦袋想着。 –

相關問題