使用CUDA線程索引作爲數字

我是CUDA和GPGPU的新手。我想查一個大組數字（大於32位）的性能，我想嘗試做到這一點使用搭載了NVIDIA GTX 1080我的Windows 7 64位機：使用CUDA線程索引作爲數字

Detected 1 CUDA Capable device(s) 

Device 0: "GeForce GTX 1080" 
    CUDA Driver Version/Runtime Version   8.0/8.0 
    CUDA Capability Major/Minor version number: 6.1 
    Total amount of global memory:     8192 MBytes (8589934592 bytes) 
    (20) Multiprocessors, (128) CUDA Cores/MP:  2560 CUDA Cores 
    GPU Max Clock rate:       1734 MHz (1.73 GHz) 
    Memory Clock rate:        5005 Mhz 
    Memory Bus Width:        256-bit 
    L2 Cache Size:         2097152 bytes 
    Maximum Texture Dimension Size (x,y,z)   1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) 
    Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers 
    Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 2048 
    Maximum number of threads per block:   1024 
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and kernel execution:   Yes with 2 copy engine(s) 
    Run time limit on kernels:      Yes 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support:      Disabled 
    CUDA Device Driver Mode (TCC or WDDM):   WDDM (Windows Display Driver Model) 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Domain ID/Bus ID/location ID: 0/1/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

當我運行下面的代碼爲「總和」的值是無意義的（28，20等），即使我能看到的threadId從0到4095：

#include <cuda.h> 
#include <cuda_runtime.h> 
#include "device_launch_parameters.h" 
#include "stdio.h" 

__global__ void Simple(unsigned long long int *sum) 
{ 
    unsigned long long int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; 

    unsigned long long int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) 
     + (threadIdx.z * (blockDim.x * blockDim.y)) 
     + (threadIdx.y * blockDim.x) 
     + threadIdx.x; 

    printf("threadId = %llu.\n", threadId); 
    // Check threadId for property. Possibly introduce a grid stride for loop to give each thread a range to check. 
    sum[0]++; 
} 

int main(int argc, char **argv) 
{ 
    unsigned long long int sum[] = { 0 }; 

    unsigned long long int *dev_sum; 

    cudaMalloc((void**)&dev_sum, sizeof(unsigned long long int)); 
    cudaMemcpy(dev_sum, sum, sizeof(unsigned long long int), cudaMemcpyHostToDevice); 

    dim3 grid(2, 1, 1); 
    dim3 block(1024, 1, 1); 

    printf("--------- Start kernel ---------\n\n"); 
    Simple <<< grid, block >>> (dev_sum); 
    cudaDeviceSynchronize(); 

    cudaMemcpy(sum, dev_sum, sizeof(unsigned long long int), cudaMemcpyDeviceToHost); 

    printf("sum = %llu.\n", sum[0]); 

    cudaFree(dev_sum); 

    getchar(); 

    return 0; 
}

我將如何修改這個內核調用，以獲得最大線程來操作（用我的設置）在一系列數字上運行（比如0到10^12），通過添加一個網格跨度循環？

dim3 grid(2, 1, 1); 
dim3 block(1024, 1, 1); 

Simple <<< grid, block >>> (dev_sum);

來源

2017-02-23 munga

用'atomicAdd（＆sum [0]，1）'替換'sum [0] ++''。 – tera

您的增值競賽條件爲 – OutOfBound

謝謝。這有助於。您能否回答設置線程的最大數量以處理大型1D數據集的問題的後半部分？ – munga

所有線程都在內存中的相同位置進行增量，這會導致競爭狀態。這就是結果不正確的原因。您應該使用原子添加來讓它正確（在CUDA中有一個函數）。

來源

2017-02-23 09:47:46 Matso

謝謝。這有助於解決問題的後半部分，以設置最大數量的線程來處理大型1D數據集？ – munga

使用CUDA線程索引作爲數字

回答

相關問題