我讀過Shuffle Tips and Tricks紙,但我不知道究竟是如何將其應用到一些狡猾的代碼,我繼承:瞭解CUDA SHFL指令
extern __shared__ unsigned int lpSharedMem[];
int tid = threadIdx.x;
lpSharedMem[tid] = startValue;
volatile unsigned int *srt = lpSharedMem;
// ...various stuff
srt[tid] = min(srt[tid], srt[tid+32]);
srt[tid] = min(srt[tid], srt[tid+16]);
srt[tid] = min(srt[tid], srt[tid+8]);
srt[tid] = min(srt[tid], srt[tid+4]);
srt[tid] = min(srt[tid], srt[tid+2]);
srt[tid] = min(srt[tid], srt[tid+1]);
__syncthreads();
即使沒有CUDA,這個代碼是模模糊糊,但看着this implementation我看到:
__device__ inline int min_warp(int val) {
val = min(val, __shfl_xor(val, 16));
val = min(val, __shfl_xor(val, 8));
val = min(val, __shfl_xor(val, 4));
val = min(val, __shfl_xor(val, 2));
val = min(val, __shfl_xor(val, 1));
return __shfl(val, 0);
}
此代碼可能是調用與:
int minVal = min_warp(startValue);
因此,我可以用上面的代碼替換我相當不利的volatile
。但是,我無法真正理解正在發生的事情;有人可以解釋我是否正確,以及min_warp()
函數中究竟發生了什麼。
看看這個https://devblogs.nvidia.com/parallelforall/faster-平行削減-開普勒/ – Hopobcn