我已經實現了兩個版本的add。兩者的加法概念完全相同。唯一的區別是在一個代碼中(下面的第一個代碼)我使用全局內存,而第二個代碼使用共享內存。正如在幾個地方提到的那樣,共享內存版本應該更快,但就我而言,全局內存版本更快。 請告訴我哪裏出錯了。注意:我有一個cc 2.1的GPU。因此,對於共享內存,我有32家銀行。由於我在示例中僅使用了16個整數,所以我的代碼不應該有銀行衝突。 請讓我知道這是否正確。簡單加法示例:共享內存版本的縮減執行速度低於全局內存
全球版本
#include<stdio.h>
__global__ void reductionGlobal(int* in, int sizeArray, int offset){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < sizeArray){
if(tid % (offset * 2) == 0){
in[tid] += in[tid+offset];
}
}
}
int main(){
int size = 16; // size of present input array. Changes after every loop iteration
int cidata[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
int* gidata;
cudaMalloc((void**)&gidata, size* sizeof(int));
cudaMemcpy(gidata,cidata, size * sizeof(int), cudaMemcpyHostToDevice);
int offset = 1;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
while(offset < size){
//use kernel launches to synchronize between different block. syncthreads() will not work
reductionGlobal<<<4,4>>>(gidata,size,offset);
offset *=2;
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);
int* output = (int*)malloc(size * sizeof(int));
cudaMemcpy(output, gidata, size * sizeof(int), cudaMemcpyDeviceToHost);
printf("The sum of the array using only global memory is %d\n",output[0]);
getchar();
return 0;
}
共享內存版本:
#include<stdio.h>
__global__ void computeAddShared(int *in , int *out, int sizeInput){
extern __shared__ float temp[];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int ltid = threadIdx.x;
temp[ltid] = 0;
while(tid < sizeInput){
temp[ltid] += in[tid];
tid+=gridDim.x * blockDim.x; // to handle array of any size
}
__syncthreads();
int offset = 1;
while(offset < blockDim.x){
if(ltid % (offset * 2) == 0){
temp[ltid] = temp[ltid] + temp[ltid + offset];
}
__syncthreads();
offset*=2;
}
if(ltid == 0){
out[blockIdx.x] = temp[0];
}
}
int main(){
int size = 16; // size of present input array. Changes after every loop iteration
int cidata[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
int* gidata;
int* godata;
cudaMalloc((void**)&gidata, size* sizeof(int));
cudaMemcpy(gidata,cidata, size * sizeof(int), cudaMemcpyHostToDevice);
int TPB = 4;
int blocks = 10; //to get things kicked off
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
while(blocks != 1){
if(size < TPB){
TPB = size; // size is 2^sth
}
blocks = (size+ TPB -1)/TPB;
cudaMalloc((void**)&godata, blocks * sizeof(int));
computeAddShared<<<blocks, TPB,TPB>>>(gidata, godata,size);
cudaFree(gidata);
gidata = godata;
size = blocks;
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);
int *output = (int*)malloc(sizeof(int));
cudaMemcpy(output, gidata, sizeof(int), cudaMemcpyDeviceToHost);
//Cant free either earlier as both point to same location
cudaFree(godata);
cudaFree(gidata);
printf("The sum of the array is %d\n", output[0]);
getchar();
return 0;
}
我也在爲最快的表現而戰,並且玩了很多方法。超出全局,頁面鎖定全局,紋理,共享,常量和寄存器...全球記憶是我的最愛。對於dot產品,我可以在單個華碩GTX260 216矩陣版上打4個teraFlops。 您需要設計內核,使內存訪問得以合併。全球內存合併速度最快。 – Prafulla
緩存層次結構可能工作得很好。嘗試在第二次執行時調整16Kb的L1和48Kb的共享內存。您也可以禁用L1緩存並比較結果。 – pQB
@pQB:如何禁用L1緩存 – Programmer