2017-06-21 90 views
2

我想獲得一些代碼,使用OpenMP在GPU上運行,但我沒有成功。在我的代碼中,我使用for循環執行矩陣乘法:一次使用OpenMP pragma標記,一次沒有。 (這樣我就可以比較執行時間了。)在第一個循環之後,我調用omp_get_num_devices()(這是我的主要測試,看看我是否實際連接到GPU)。無論我嘗試什麼,omp_get_num_devices()始終返回0如何使用OpenMP提供的GPU?

我正在使用的計算機有兩個NVIDIA Tesla K40M GPU。 CUDA 7.0和CUDA 7.5作爲模塊在計算機上提供,並且CUDA 7.5模塊通常處於活動狀態。 gcc 4.9.3,5.1.0和7.1.0都可以作爲模塊使用,gcc 7.1.0模塊通常處於活動狀態。我正在編寫我的代碼$ g++ -fopenmp -omptargets=nvptx64sm_35-nvidia-linux ParallelExperimenting.cpp -o ParallelExperimenting。我已經成功使用CPU並行處理了OpenMP代碼,但沒有使用GPU。

我的主要目標是讓omp_get_num_devices()返回2,以證明我可以在OpenMP中檢測和使用GPU。我在這裏接受任何幫助將不勝感激。

這裏是我使用的檢查,如果被正確或不使用的GPU代碼:

#include <omp.h> 
#include <fstream> 
#include <stdio.h> 
#include <math.h> 
#include <stdlib.h> 
#include <time.h> 
#include <iomanip> 
#include <cstdio> 
#include <stdlib.h> 
#include <iostream> 
#include <time.h> 
using namespace std; 

double A [501][501]; 
double B [501][501]; 
double C [501][501][501]; 
double D [501][501]; 
double E [501][501]; 
double F [501][501][501]; 
double dummyvar; 
int Mapped [501]; 

int main() { 
    int i, j, k, l, N, StallerGPU, StallerCPU; 

    // 
    N = 500; 

    // Variables merely uses to make the execution take longer and to 
    // exaggurate the difference in performance between first and second 
    // calculation 
    StallerGPU = 200; 
    StallerCPU = 200; 

    std::cout << " N = " << N << "\n"; 
    // generate matrix to be used in first calculation 
    for (i=0; i<N; i++) { 
     for (k=0; k<N; k++) { 
      if (i == k) { 
       A[i][k] = i+1; 
      } else { 
       A[i][k] = i * k/N; 
      } 
     } 
    } 
    // generate other matrix to be used for the first calculation 
    for (k=0; k<N; k++) { 
     for (j=0; j<N; j++) { 
      B[k][j] = 2*(N-1)-k-j; 
     } 
    } 

// Slightly adjusted matrices for second calculation 
    for (i=0; i<N; i++) { 
     for (k=0; k<N; k++) { 
      if (i == k) { 
       D[i][k] = i+2; 
      } else { 
       D[i][k] = i * k/N - 1; 
      } 
     } 
    } 

    for (k=0; k<N; k++) { 
     for (j=0; j<N; j++) { 
      E[k][j] = 2*(N+1)-k-j; 
     } 
    } 

    dummyvar = 0; 

    //Run the multiplication in parallel using GPUs 

    double diff; 
    time_t time1; 
    time1 = time(NULL); // CPU time counter 
    cout << endl << " GPU section begins at " << ctime(&time1) << endl; 

     // This pragma is frequently changed to try different tags 
     #pragma omp for collapse(4) private(i, j, k, l) 

     for (i=0; i<N; i++) { 
//   Mapped[i] = omp_is_initial_device(); 
      for (j=0; j<N; j++) { 
       for (k=0; k<N; k++) { 
        for(l = 0; l < StallerGPU; l++) { 
         C[i][j][k] = A[i][k] * B[k][j] ; 
         dummyvar += A[i][k] * B[k][j] * (l + 1); 
        } 
       } 
//   cout << " i " << i << endl; 
      } 
     } 


    //record the time it took to run the multiplication  
    time_t time2 = time(NULL); 
    cout << " number of devices: " << omp_get_num_devices() << endl; 
    cout << " dummy variable: " << dummyvar << endl; 

    float cpumin = difftime(time2,time1); 
    diff = difftime(time2,time1); 
    cout << " stopping at delta GPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time2) << endl; 
    cout << " GPU time elasped " << diff << " s" << endl; 
    cout << endl; 

    dummyvar = 0; 
    time_t time3 = time(NULL); 
    cout << endl << " CPU section begins at " << ctime(&time3) << endl; 
// #pragma omp single 
    for (i=0; i<N; i++) { 
     for (j=0; j<N; j++) { 
      for (k=0; k<N; k++) { 
       for (int l=0; l<StallerCPU; l++) { 
        F[i][j][k] = D[i][k] * E[k][j]; 
        dummyvar += D[i][k] * E[k][j] * (l - 1); 
       } 
      } 
     } 
    } 
    // the sum to complete the matrix calculation is left out here, but would 
    // only be used to check if the result of the calculation is correct 

    time_t time4 = time(NULL); 
    cpumin = difftime(time4,time3); 
    diff = difftime(time4,time3); 
    cout << " dummy variable: " << dummyvar << endl; 
    cout << " stopping at delta CPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time4) << endl; 
    cout << " CPU time elasped " << diff << " s" << endl; 
    //Compare the time it took to confirm that we actually used GPUs to parallelize. 
} 

這裏是運行DEVICEQUERY樣本CUDA代碼的結果。

./deviceQuery Starting... 

CUDA Device Query (Runtime API) version (CUDART static linking) 

Detected 2 CUDA Capable device(s) 

Device 0: "Tesla K40m" 
    CUDA Driver Version/Runtime Version   7.5/7.5 
    CUDA Capability Major/Minor version number: 3.5 
    Total amount of global memory:     11520 MBytes (12079136768 bytes) 
    (15) Multiprocessors, (192) CUDA Cores/MP:  2880 CUDA Cores 
    GPU Max Clock rate:       745 MHz (0.75 GHz) 
    Memory Clock rate:        3004 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         1572864 bytes 
    Maximum Texture Dimension Size (x,y,z)   1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) 
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers 
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 2048 
    Maximum number of threads per block:   1024 
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and kernel execution:   Yes with 2 copy engine(s) 
    Run time limit on kernels:      No 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support:      Enabled 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Domain ID/Bus ID/location ID: 0/130/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

Device 1: "Tesla K40m" 
    CUDA Driver Version/Runtime Version   7.5/7.5 
    CUDA Capability Major/Minor version number: 3.5 
    Total amount of global memory:     11520 MBytes (12079136768 bytes) 
    (15) Multiprocessors, (192) CUDA Cores/MP:  2880 CUDA Cores 
    GPU Max Clock rate:       745 MHz (0.75 GHz) 
    Memory Clock rate:        3004 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         1572864 bytes 
    Maximum Texture Dimension Size (x,y,z)   1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) 
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers 
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 2048 
    Maximum number of threads per block:   1024 
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and kernel execution:   Yes with 2 copy engine(s) 
    Run time limit on kernels:      No 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support:      Enabled 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Domain ID/Bus ID/location ID: 0/131/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : Yes 
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : Yes 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m 
Result = PASS 
+0

你可以上傳一個最低工作示例,顯示你正在嘗試做什麼? – Richard

+0

歡迎來到Stack Overflow!你的帖子不幸遺失了[mcve]。請訪問[幫助中心](http://stackoverflow.com/help)並閱讀[如何提出一個好問題]部分(http://stackoverflow.com/help/how-to-ask)。 –

+0

我添加了我的測試代碼。 – Josiah

回答

1

GCC 4.9.3和5.1.0絕對不支持OpenMP卸載到GPU。 GCC 7.1.0確實支持它,但它應該使用特殊配置選項as described here來構建。

+0

這解決了我的問題!非常感謝!!! – Josiah

0

也許我在一個錯誤的方向。但我想幫助,因爲我曾經在使用GPU的奇怪的情況下,

您需要位於linux的「視頻」組,因此您可以使用GPU。

或全部結果從GPU返回將是0

所以我會建議你運行示例代碼CUDA來檢查,如果你是在我以前被卡住的情況。

這很奇怪。我不確定我是否正確描述了它。 希望它有幫助。


根據本:https://wiki.gentoo.org/wiki/NVidia/nvidia-drivers

無需訪問視頻卡用戶(S)將需要添加到 視頻組

+0

你爲什麼不提供鏈接引用?這聽起來是一個很好的答案,這聽起來會更好 – Thecave3

+0

http://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Install.aspx –

+0

我將添加運行的結果deviceQuery示例CUDA代碼。 – Josiah

2

我可能是錯的,但我認爲你需要對發佈的代碼進行一些更正(也許你已經知道了)。要真正在使用OpenMP的GPU目標運行,你需要更換:

#pragma omp for collapse(4) private(i, j, k, l) 

#pragma omp target teams distribute parallel for collapse(4) private(i, j, k, l) 

您可以驗證如果內核實際上是在GPU上用「nvprof」剖析你的可執行文件運行。它應該顯示在GPU上執行的內核。您還可以使用'num_teams'和'thread_limit'子句更改目標區域中的團隊和線程數量,並且您應該在您的配置文件中看到相應的更改。

要以編程方式實際檢查目標區域是否在目標設備上運行,我使用'omp_is_initial_device()'調用,該調用在從加速器調用時返回0。下面是一個例子:

int A[1] = {-1}; 
#pragma omp target 
{ 
    A[0] = omp_is_initial_device(); 
} 

if (!A[0]) { 
    printf("Able to use offloading!\n"); 
} 
+0

我試圖按照你的建議用'nvprof'來描述它。程序完成其執行後,我收到一個錯誤'========警告:沒有CUDA應用程序分析,退出'。當我添加'omp_is_initial_device()'時,它每次都返回1。 – Josiah

+0

這似乎強烈表明你的內核正在CPU上運行。正如Ilya提到的,你可能需要編譯gcc以支持gpu。 –

+0

爲什麼你需要使用一個元素的數組而不僅僅是一個簡單的整數?我試過你的代碼,它只適用於一個數組,但我不明白爲什麼。 –