我正試圖在CUDA中實現FIR(有限衝激響應)濾波器。我的方法很簡單,有點類似於:CUDA中的FIR濾波器(作爲1D卷積)
#include <cuda.h>
__global__ void filterData(const float *d_data,
const float *d_numerator,
float *d_filteredData,
const int numeratorLength,
const int filteredDataLength)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
float sum = 0.0f;
if (i < filteredDataLength)
{
for (int j = 0; j < numeratorLength; j++)
{
// The first (numeratorLength-1) elements contain the filter state
sum += d_numerator[j] * d_data[i + numeratorLength - j - 1];
}
}
d_filteredData[i] = sum;
}
int main(void)
{
// (Skipping error checks to make code more readable)
int dataLength = 18042;
int filteredDataLength = 16384;
int numeratorLength= 1659;
// Pointers to data, filtered data and filter coefficients
// (Skipping how these are read into the arrays)
float *h_data = new float[dataLength];
float *h_filteredData = new float[filteredDataLength];
float *h_filter = new float[numeratorLength];
// Create device pointers
float *d_data = nullptr;
cudaMalloc((void **)&d_data, dataLength * sizeof(float));
float *d_numerator = nullptr;
cudaMalloc((void **)&d_numerator, numeratorLength * sizeof(float));
float *d_filteredData = nullptr;
cudaMalloc((void **)&d_filteredData, filteredDataLength * sizeof(float));
// Copy data to device
cudaMemcpy(d_data, h_data, dataLength * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_numerator, h_numerator, numeratorLength * sizeof(float), cudaMemcpyHostToDevice);
// Launch the kernel
int threadsPerBlock = 256;
int blocksPerGrid = (filteredDataLength + threadsPerBlock - 1)/threadsPerBlock;
filterData<<<blocksPerGrid,threadsPerBlock>>>(d_data, d_numerator, d_filteredData, numeratorLength, filteredDataLength);
// Copy results to host
cudaMemcpy(h_filteredData, d_filteredData, filteredDataLength * sizeof(float), cudaMemcpyDeviceToHost);
// Clean up
cudaFree(d_data);
cudaFree(d_numerator);
cudaFree(d_filteredData);
// Do stuff with h_filteredData...
// Clean up some more
delete [] h_data;
delete [] h_filteredData;
delete [] h_filter;
}
過濾器的工作原理,但我是新來的CUDA編程,我不知道如何去優化它。
,我看到的輕微問題是dataLength
,filteredDataLength
,和numeratorLength
手前在應用程序是未知的,我打算使用的過濾器中,另外,即使dataLength
是32
在上面的代碼的倍數,它不能保證在最終的應用程序中。
當我將上面的代碼與ArrayFire進行比較時,我的代碼需要大約三倍的時間才能執行。
有沒有人有關於如何加快速度的任何想法?
編輯:已將所有filterLength
更改爲numeratorLength
。
是'numeratorLength'一樣'filterLength'?在您發佈的內容中,我沒有看到「numeratorLength」的定義。這個問題本質上是一個一維模板問題。對模板問題的標準優化是將一部分輸入數據放入共享內存中,足以讓塊的線程計算其輸出,然後讓這些線程在共享內存副本之外工作。 – 2013-04-06 19:09:03
如果你最終打敗ArrayFire,請告訴我們!如果沒有,你總是可以自由使用ArrayFire,因爲它更快:) – arrayfire 2013-04-07 03:06:18
@RobertCrovella是的,numeratorLength與filterLength相同。我決定改名,但顯然錯過了幾個地方。我的壞,對不起。我修改了原始帖子,以便只有分子長度。感謝您使用共享內存的建議。我已經讀過,這些速度比全局內存快得多,但我對如何最好地實現這一點有些不確定,因爲共享內存的大小有限,而且過濾器的長度可能會很長。我會玩弄它,看看它是怎麼回事 – Elfendahl 2013-04-08 03:44:28