將陣列從RAM複製到GPU並從GPU複製到RAM

我試圖在我的一個項目中介紹一些CUDA優化。但我認爲我在這裏做錯了什麼。我想實現一個簡單的矩陣向量乘法（result = matrix * vector）。但是當我想將結果複製回主機時，會發生錯誤（cudaErrorLaunchFailure）。我的內核中是否有錯誤（matrixVectorMultiplicationKernel），或者我錯誤地調用cudaMemcpy？我發現這種錯誤狀態沒有有用的文檔。我認爲這完全破壞了GPU的狀態，因爲我不能在第一次出現之後再次出現此錯誤的情況下調用任何CUDA內核。將陣列從RAM複製到GPU並從GPU複製到RAM

編輯＃1：更新的代碼，遵循leftaroundabout的建議。

// code 
... 
Eigen::MatrixXf matrix(M, N); // matrix.data() usually should return a float array 
Eigen::VectorXf vector(N); // same here for vector.data() 
Eigen::VectorXf result(M); 
... // fill matrix and vector 
float* matrixOnDevice = copyMatrixToDevice(matrix.data(), matrix.rows(), matrix.cols()); 
matrixVectorMultiplication(matrixOnDevice, vector.data(), result.data(), matrix.rows(), cm.cols()); 
... // clean up 

// helper functions 
float* copyMatrixToDevice(const float* matrix, int mRows, int mCols) 
{ 
    float* matrixOnDevice; 
    const int length = mRows*mCols; 
    const int size = length * sizeof(float); 
    handleCUDAError(cudaMalloc((void**)&matrixOnDevice, size)); 
    handleCUDAError(cudaMemcpy(matrixOnDevice, matrix, size, cudaMemcpyHostToDevice)); 
    return matrixOnDevice; 
} 

void matrixVectorMultiplication(const float* matrixOnDevice, const float* vector, float* result, int mRows, int mCols) 
{ 
    const int vectorSize = mCols*sizeof(float); 
    const int resultSize = mRows*sizeof(float); 
    const int matrixLength = mRows*mCols; 
    float* deviceVector; 
    float* deviceResult; 
    handleCUDAError(cudaMalloc((void**)&deviceVector, vectorSize)); 
    handleCUDAError(cudaMalloc((void**)&deviceResult, resultSize)); 
    handleCUDAError(cudaMemset(deviceResult, 0, resultSize)); 
    handleCUDAError(cudaMemcpy(deviceVector, vector, vectorSize, cudaMemcpyHostToDevice)); 
    int threadsPerBlock = 256; 
    int blocksPerGrid = (mRows + threadsPerBlock - 1)/threadsPerBlock; 
    matrixVectorMultiplicationKernel<<<blocksPerGrid, threadsPerBlock>>>(matrixOnDevice, vector, result, mRows, mCols, matrixLength); 
    // --- no errors yet --- 
    handleCUDAError(cudaMemcpy(result, deviceResult, resultSize, cudaMemcpyDeviceToHost)); // cudaErrorLaunchFailure 
    handleCUDAError(cudaFree(deviceVector)); // cudaErrorLaunchFailure 
    handleCUDAError(cudaFree(deviceResult)); // cudaErrorLaunchFailure 
} 

__global__ void matrixVectorMultiplicationKernel(const float* matrix, const float* vector, float* result, int mRows, int mCols, int length) 
{ 
    int row = blockDim.x * blockIdx.x + threadIdx.x; 
    if(row < mRows) 
    { 
    for(int col = 0, mIdx = row*mCols; col < mCols; col++, mIdx++) 
     result[row] += matrix[mIdx] * vector[col]; 
    } 
}

來源

2012-04-16 alfa

使用CUBLAS而不是自己寫這樣的內核是合理的。 – leftaroundabout 2012-04-16 16:40:15

我想我會很快做到這一點。但cublas似乎很複雜，我想從簡單的事情開始。 – alfa 2012-04-16 16:55:03

在我看來，CUBLAS更簡單（但也更具限制性）。 – 2012-04-18 08:16:08

你的問題是void copyMatrixToDevice(..., float* matrixOnDevice, ...)按值取這個指針，即它不能「輸出」設備矩陣。你可以用void copyMatrixToDevice(..., float** matrixOnDevice, ...)做到這一點，通過

copyMatrixToDevice(matrix.data(), &matrixOnDevice, matrix.rows(), matrix.cols());

稱爲有與matrixVectorMultiplicationresult同樣的問題。

從長遠來看，在C++中，您應該在所有這些環節中放置合適的類抽象層。

來源

2012-04-16 16:42:54 leftaroundabout

好的，通常我應該自己找到第一個錯誤（'** matrixOnDevice'）。謝謝！這就是爲什麼我必須將一個（void **）傳遞給cudaMalloc的原因。第二條建議對我來說並不明確。 cudaMemcpy不會更改'result'的地址。爲什麼將它作爲float *傳遞是不夠的？無論如何，錯誤仍然存在。它沒有完全解決問題。 – alfa 2012-04-16 17:23:08

對，我沒有正確地看'matrixVectorMultiplication'。那個確實有效，但你並沒有特別一致。 – leftaroundabout 2012-04-16 17:25:28

好的，我現在發現了最後一個錯誤，我應該用位於設備上的地址調用內核...'matrixVectorMultiplicationKernel <<< blocksPerGrid，threadsPerBlock >>>（matrixOnDevice，** deviceVector **，** deviceResult * *，mRows，mCols，matrixLength）;' – alfa 2012-04-16 18:01:01

將陣列從RAM複製到GPU並從GPU複製到RAM

回答

相關問題