如何使用OpenMP提供的GPU？

我想獲得一些代碼，使用OpenMP在GPU上運行，但我沒有成功。在我的代碼中，我使用for循環執行矩陣乘法：一次使用OpenMP pragma標記，一次沒有。（這樣我就可以比較執行時間了。）在第一個循環之後，我調用omp_get_num_devices()（這是我的主要測試，看看我是否實際連接到GPU）。無論我嘗試什麼，omp_get_num_devices()始終返回0如何使用OpenMP提供的GPU？

我正在使用的計算機有兩個NVIDIA Tesla K40M GPU。 CUDA 7.0和CUDA 7.5作爲模塊在計算機上提供，並且CUDA 7.5模塊通常處於活動狀態。 gcc 4.9.3，5.1.0和7.1.0都可以作爲模塊使用，gcc 7.1.0模塊通常處於活動狀態。我正在編寫我的代碼$ g++ -fopenmp -omptargets=nvptx64sm_35-nvidia-linux ParallelExperimenting.cpp -o ParallelExperimenting。我已經成功使用CPU並行處理了OpenMP代碼，但沒有使用GPU。

我的主要目標是讓omp_get_num_devices()返回2，以證明我可以在OpenMP中檢測和使用GPU。我在這裏接受任何幫助將不勝感激。

這裏是我使用的檢查，如果被正確或不使用的GPU代碼：

#include <omp.h> 
#include <fstream> 
#include <stdio.h> 
#include <math.h> 
#include <stdlib.h> 
#include <time.h> 
#include <iomanip> 
#include <cstdio> 
#include <stdlib.h> 
#include <iostream> 
#include <time.h> 
using namespace std; 

double A [501][501]; 
double B [501][501]; 
double C [501][501][501]; 
double D [501][501]; 
double E [501][501]; 
double F [501][501][501]; 
double dummyvar; 
int Mapped [501]; 

int main() { 
    int i, j, k, l, N, StallerGPU, StallerCPU; 

    // 
    N = 500; 

    // Variables merely uses to make the execution take longer and to 
    // exaggurate the difference in performance between first and second 
    // calculation 
    StallerGPU = 200; 
    StallerCPU = 200; 

    std::cout << " N = " << N << "\n"; 
    // generate matrix to be used in first calculation 
    for (i=0; i<N; i++) { 
     for (k=0; k<N; k++) { 
      if (i == k) { 
       A[i][k] = i+1; 
      } else { 
       A[i][k] = i * k/N; 
      } 
     } 
    } 
    // generate other matrix to be used for the first calculation 
    for (k=0; k<N; k++) { 
     for (j=0; j<N; j++) { 
      B[k][j] = 2*(N-1)-k-j; 
     } 
    } 

// Slightly adjusted matrices for second calculation 
    for (i=0; i<N; i++) { 
     for (k=0; k<N; k++) { 
      if (i == k) { 
       D[i][k] = i+2; 
      } else { 
       D[i][k] = i * k/N - 1; 
      } 
     } 
    } 

    for (k=0; k<N; k++) { 
     for (j=0; j<N; j++) { 
      E[k][j] = 2*(N+1)-k-j; 
     } 
    } 

    dummyvar = 0; 

    //Run the multiplication in parallel using GPUs 

    double diff; 
    time_t time1; 
    time1 = time(NULL); // CPU time counter 
    cout << endl << " GPU section begins at " << ctime(&time1) << endl; 

     // This pragma is frequently changed to try different tags 
     #pragma omp for collapse(4) private(i, j, k, l) 

     for (i=0; i<N; i++) { 
//   Mapped[i] = omp_is_initial_device(); 
      for (j=0; j<N; j++) { 
       for (k=0; k<N; k++) { 
        for(l = 0; l < StallerGPU; l++) { 
         C[i][j][k] = A[i][k] * B[k][j] ; 
         dummyvar += A[i][k] * B[k][j] * (l + 1); 
        } 
       } 
//   cout << " i " << i << endl; 
      } 
     } 


    //record the time it took to run the multiplication  
    time_t time2 = time(NULL); 
    cout << " number of devices: " << omp_get_num_devices() << endl; 
    cout << " dummy variable: " << dummyvar << endl; 

    float cpumin = difftime(time2,time1); 
    diff = difftime(time2,time1); 
    cout << " stopping at delta GPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time2) << endl; 
    cout << " GPU time elasped " << diff << " s" << endl; 
    cout << endl; 

    dummyvar = 0; 
    time_t time3 = time(NULL); 
    cout << endl << " CPU section begins at " << ctime(&time3) << endl; 
// #pragma omp single 
    for (i=0; i<N; i++) { 
     for (j=0; j<N; j++) { 
      for (k=0; k<N; k++) { 
       for (int l=0; l<StallerCPU; l++) { 
        F[i][j][k] = D[i][k] * E[k][j]; 
        dummyvar += D[i][k] * E[k][j] * (l - 1); 
       } 
      } 
     } 
    } 
    // the sum to complete the matrix calculation is left out here, but would 
    // only be used to check if the result of the calculation is correct 

    time_t time4 = time(NULL); 
    cpumin = difftime(time4,time3); 
    diff = difftime(time4,time3); 
    cout << " dummy variable: " << dummyvar << endl; 
    cout << " stopping at delta CPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time4) << endl; 
    cout << " CPU time elasped " << diff << " s" << endl; 
    //Compare the time it took to confirm that we actually used GPUs to parallelize. 
}

這裏是運行DEVICEQUERY樣本CUDA代碼的結果。

./deviceQuery Starting... 

CUDA Device Query (Runtime API) version (CUDART static linking) 

Detected 2 CUDA Capable device(s) 

Device 0: "Tesla K40m" 
    CUDA Driver Version/Runtime Version   7.5/7.5 
    CUDA Capability Major/Minor version number: 3.5 
    Total amount of global memory:     11520 MBytes (12079136768 bytes) 
    (15) Multiprocessors, (192) CUDA Cores/MP:  2880 CUDA Cores 
    GPU Max Clock rate:       745 MHz (0.75 GHz) 
    Memory Clock rate:        3004 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         1572864 bytes 
    Maximum Texture Dimension Size (x,y,z)   1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) 
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers 
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 2048 
    Maximum number of threads per block:   1024 
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and kernel execution:   Yes with 2 copy engine(s) 
    Run time limit on kernels:      No 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support:      Enabled 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Domain ID/Bus ID/location ID: 0/130/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

Device 1: "Tesla K40m" 
    CUDA Driver Version/Runtime Version   7.5/7.5 
    CUDA Capability Major/Minor version number: 3.5 
    Total amount of global memory:     11520 MBytes (12079136768 bytes) 
    (15) Multiprocessors, (192) CUDA Cores/MP:  2880 CUDA Cores 
    GPU Max Clock rate:       745 MHz (0.75 GHz) 
    Memory Clock rate:        3004 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         1572864 bytes 
    Maximum Texture Dimension Size (x,y,z)   1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) 
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers 
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 2048 
    Maximum number of threads per block:   1024 
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and kernel execution:   Yes with 2 copy engine(s) 
    Run time limit on kernels:      No 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support:      Enabled 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Domain ID/Bus ID/location ID: 0/131/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : Yes 
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : Yes 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m 
Result = PASS

來源

2017-06-21 Josiah

你可以上傳一個最低工作示例，顯示你正在嘗試做什麼？ – Richard

歡迎來到Stack Overflow！你的帖子不幸遺失了[mcve]。請訪問[幫助中心]（http://stackoverflow.com/help）並閱讀[如何提出一個好問題]部分（http://stackoverflow.com/help/how-to-ask）。 –

我添加了我的測試代碼。 – Josiah

GCC 4.9.3和5.1.0絕對不支持OpenMP卸載到GPU。 GCC 7.1.0確實支持它，但它應該使用特殊配置選項as described here來構建。

來源

2017-06-27 21:52:21

這解決了我的問題！非常感謝！！！ – Josiah

也許我在一個錯誤的方向。但我想幫助，因爲我曾經在使用GPU的奇怪的情況下，

。

您需要位於linux的「視頻」組，因此您可以使用GPU。

或全部結果從GPU返回將是0

所以我會建議你運行示例代碼CUDA來檢查，如果你是在我以前被卡住的情況。

這很奇怪。我不確定我是否正確描述了它。希望它有幫助。

根據本：https://wiki.gentoo.org/wiki/NVidia/nvidia-drivers

無需訪問視頻卡用戶（S）將需要添加到視頻組

來源

2017-06-21 22:34:27

你爲什麼不提供鏈接引用？這聽起來是一個很好的答案，這聽起來會更好 – Thecave3

http://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Install.aspx –

我將添加運行的結果deviceQuery示例CUDA代碼。 – Josiah

我可能是錯的，但我認爲你需要對發佈的代碼進行一些更正（也許你已經知道了）。要真正在使用OpenMP的GPU目標運行，你需要更換：

#pragma omp for collapse(4) private(i, j, k, l)

與

#pragma omp target teams distribute parallel for collapse(4) private(i, j, k, l)

您可以驗證如果內核實際上是在GPU上用「nvprof」剖析你的可執行文件運行。它應該顯示在GPU上執行的內核。您還可以使用'num_teams'和'thread_limit'子句更改目標區域中的團隊和線程數量，並且您應該在您的配置文件中看到相應的更改。

要以編程方式實際檢查目標區域是否在目標設備上運行，我使用'omp_is_initial_device（）'調用，該調用在從加速器調用時返回0。下面是一個例子：

int A[1] = {-1}; 
#pragma omp target 
{ 
    A[0] = omp_is_initial_device(); 
} 

if (!A[0]) { 
    printf("Able to use offloading!\n"); 
}

來源

2017-06-22 14:53:03

我試圖按照你的建議用'nvprof'來描述它。程序完成其執行後，我收到一個錯誤'========警告：沒有CUDA應用程序分析，退出'。當我添加'omp_is_initial_device（）'時，它每次都返回1。 – Josiah

這似乎強烈表明你的內核正在CPU上運行。正如Ilya提到的，你可能需要編譯gcc以支持gpu。 –

爲什麼你需要使用一個元素的數組而不僅僅是一個簡單的整數？我試過你的代碼，它只適用於一個數組，但我不明白爲什麼。 –

如何使用OpenMP提供的GPU？

回答

相關問題