使用NVIDIA Quadro M4000降低主機 - 設備傳輸速率

-1

我在安裝在PCIe 3x16上的NVIDIA Quadro M4000上做OpenCL。在卡文檔中，據說CPU-> GPU的傳輸速率可以達到15.7Gb/s，而在我的基準測試中，它只能達到〜2.4Gb/s。我知道有效的傳輸速率可能與理論傳輸速率有很大的不同，但我並沒有預料到這種差異會非常大。使用NVIDIA Quadro M4000降低主機 - 設備傳輸速率

任何人都有關於quadro CPU-> GPU數據傳輸的經驗。

感謝

#include<iostream> 
#include<cstdlib> 
#include<cstdio> 
#include<string> 
#include<cmath> 
#include<CL/cl.h> 
#include <Windows.h> 

using namespace std; 

SYSTEMTIME last_call; 

cl_platform_id platform_id = NULL; 
cl_uint ret_num_platform; 
cl_device_id device_id = NULL; 
cl_uint ret_num_device; 
cl_context context = NULL; 
cl_command_queue command_queue = NULL; 
cl_program program = NULL; 
cl_kernel kernel = NULL; 
cl_int err; 

void _profile(char* msg){ 
SYSTEMTIME tmp; 

clFinish(command_queue); 

GetSystemTime(&tmp); 
printf("__Profile --- %s --- : %d : %d : %d\n", msg, (tmp.wMinute - last_call.wMinute), 
    (tmp.wSecond - last_call.wSecond), 
    (tmp.wMilliseconds - last_call.wMilliseconds)); 
    last_call = tmp; 
} 

int main() 
{ 

// Reading Kernel Program 
char *kernel_src_std = "__kernel void copy(__global const uchar *x, __global uchar *z){\ 
         const int id = get_global_id(0);\ 
         z[id] = x[id]; \ 
         }"; 
size_t kernel_src_size = strlen(kernel_src_std); 

// Create Input data 
int w = 1920; 
int h = 1080; 
int c = 3; 

float* input = (float*)malloc(w * h * c * sizeof(float)); 
for(int i=0;i<w*h*c;i++) 
    input[i] = (float)rand()/RAND_MAX; 


// getting platform ID 
err = clGetPlatformIDs(1, &platform_id, &ret_num_platform); 

// Get Device ID 
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_device); 

// Create Context 
context = clCreateContext(NULL,1,&device_id,NULL,NULL,&err); 

// Create Command Queue 
command_queue = clCreateCommandQueue(context, device_id, 0, &err); 

// Create buffer Object 
cl_mem buf_in = clCreateBuffer(context,CL_MEM_READ_ONLY, sizeof(float) * w*h*c, 
    0, &err); 

cl_mem buf_out = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(float) * w*h*c, 
    0, &err); 

_profile("Start transfer input..."); 

// Copy Data from Host to Device 
cl_event event[5]; 

err = clEnqueueWriteBuffer(command_queue,buf_in,CL_TRUE, 0, sizeof(float)*w*h*c,input,0,NULL, NULL); 

_profile("End transfer input..."); 

// Create and Build Program 
program = clCreateProgramWithSource(context, 1, (const char **)&kernel_src_std, 0, &err); 

// Create Kernel 
kernel = clCreateKernel(program,"copy",&err); 

// Set Kernel Arguments 

err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&buf_in); 

err = clSetKernelArg(kernel, 1,sizeof(cl_mem), (void *)&buf_out); 

// Execute Kernel 
size_t ws[]={h*w*c}; 
size_t lws[]={100}; 
err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, ws, lws, 0, NULL, NULL); 

// Create output buf 
float* output = (float*)malloc(sizeof(float)*w*h*c); 

// Read output Data, from Device to Host 
err = clEnqueueReadBuffer(command_queue, buf_out, CL_TRUE, 0, sizeof(float)*w*h*c, output,NULL,NULL,NULL); 


//Release Objects 

clReleaseMemObject(buf_in); 
clReleaseMemObject(buf_out); 
clReleaseKernel(kernel); 
clReleaseProgram(program); 
clReleaseCommandQueue(command_queue); 
clReleaseContext(context); 
free(input); 
free(output); 

while(1); 

return(0); 
}

來源

2016-03-07 lity

它可能取決於您如何傳輸數據。你怎麼做呢？它只是一個巨大的陣列，你轉移到GPU？另外：您是否將您的結果與CUDA示例中給出的基準進行了比較？ – CygnusX1

是的，我正在傳輸一個6220800浮點數組。這可能是原因嗎？我無法在我的安裝中找到OpenCL示例。人們似乎認爲NVIDIA不再維護這些產品。我使用clEnqueueWriteBuffer進行傳輸。 – lity

@lity這個問題似乎是關於OpenCL，但也被標記爲[cuda]。我建議刪除該標籤以避免混淆。確保該卡插入PCIe x16插槽，而不是PCIe x4插槽。對於傳輸大小> = 16 MB，PCIe gen3 x16的最大實際傳輸速率約爲11-12 GB/sec。您需要在主機上使用「固定內存」以獲得最高傳輸速度，但不確定OpenCL是否支持該功能。 – njuffa

至於你的問題是模糊的，很難找出你表現不佳的確切原因。一些具體代碼可能幫助。

但是，在你的評論中，你說你轉移了一個6220800浮點數組。這大約是200兆轉移。以最大傳輸速率（15.7Gb/s）應該可以提供大約12ms。

但是，隨着每個新的傳輸請求，還有一個延遲被添加，這對於小的傳輸---可以有效地降低您的傳輸速率。

你有沒有嘗試過對數組進行基準測試（比如，尺寸是100x）？

來源

2016-03-07 14:04:34 CygnusX1

對不起，我錯過了類型：傳輸速率實際上是15.7GB/s（而不是Gb/s）。所以預期的轉換時間應該是〜1.7ms。我嘗試過更大的陣列（100x），仍然有相同的速度。這是我用於傳輸的一段代碼： 'result = clEnqueueWriteBuffer（cl.command_queue，kernel.kernel_inputs [i] .data，CL_TRUE，0，sizeof（float）* kernel.kernel_inputs [i] .size ，kernel.inputs [i] .data，0，NULL，NULL）;' 謝謝 – lity

你無法用阻塞調用和在CPU端測量傳輸速度。你應該使用clEvents。這種測量方式是虛假的，可能會給您任何結果，具體取決於觸發回調時CPU和GPU之間的延遲。 – DarkZeros

我想你可以。在NVDIA OpenCL最佳實踐中，它在opencl blocking命令調用上使用CPU定時器是所提出的兩種剖析方法之一（2.1章） – lity

您正在使用阻塞傳輸，這意味着您在讀取/寫入請求時出現停頓（此外，您並未使用固定內存，但是解決了這個問題）。目前，您的代碼爲

開始計時 - >寫入 - >停止 - >內核 - >讀取 - >停止 - >結束計時。如果您的傳輸範圍大約爲2ms，這將大大影響您的內存帶寬傳輸時間，因爲這些檔位在尺寸上與此相當。如果要精確測量帶寬，則需要消除這些停頓。

來源

2016-03-12 18:09:33

使用NVIDIA Quadro M4000降低主機 - 設備傳輸速率

回答

相關問題