快速CUDA推力自定義比較運算符

我正在評估CUDA並正在使用Thrust庫對數字進行排序。快速CUDA推力自定義比較運算符

我想爲推力::排序創建我自己的比較器，但它會大大減慢速度！我創建了我自己的減去通過從functional.h複製代碼實現。然而，它似乎是以其他方式編譯的，而且工作速度非常緩慢。

默認的比較：推力::以下（） - 毫秒
我自己比較器：以下（） - 毫秒

我使用Visual Studio 2010的什麼我應該怎麼做才能獲得與選項1相同的性能？

完整代碼：

#include <stdio.h> 

#include <cuda.h> 

#include <thrust/host_vector.h> 
#include <thrust/device_vector.h> 
#include <thrust/generate.h> 
#include <thrust/sort.h> 

int myRand() 
{ 
     static int counter = 0; 
     if (counter++ % 10000 == 0) 
       srand(time(NULL)+counter); 
     return (rand()<<16) | rand(); 
} 

template<typename T> 
struct less : public thrust::binary_function<T,T,bool> 
{ 
    __host__ __device__ bool operator()(const T &lhs, const T &rhs) const { 
    return lhs < rhs; 
    } 
}; 

int main() 
{ 
    thrust::host_vector<int> h_vec(10 * 1000 * 1000); 
    thrust::generate(h_vec.begin(), h_vec.end(), myRand); 

    thrust::device_vector<int> d_vec = h_vec; 

    int clc = clock(); 
    thrust::sort(d_vec.begin(), d_vec.end(), less<int>()); 
    printf("%dms\n", (clock()-clc) * 1000/CLOCKS_PER_SEC); 

    return 0; 
}

來源

2012-01-27 Anton Burtsev

好奇，如果你已經嘗試ArrayFire的排序功能。可能對你的分析有用。 – arrayfire 2012-01-28 01:54:51

你觀察性能差異的原因是因爲推力正在實施與排序依據提供給thrust::sort的參數不同的算法。

在案例1中，Thrust可以證明這種排序可以用基數排序的線性時間實現。這是因爲要排序的數據類型是內置數值類型（int），並且比較函數是內置小於操作 - 推力識別thrust::less<int>將產生與x < y等效的結果。

在情況2，推力知道也不關心你的用戶提供less<int>，並有使用基於一個比較排序具有不同的漸近複雜性，即使在真理的less<int>相當於thrust::less<int>更保守的算法。

通常，用戶定義的比較運算符不能用於處理數據二進制表示（例如基數排序）的更嚴格，更快速的排序。在這些情況下，Thrust會回到更一般的，但更慢的排序。

來源

2012-01-27 22:13:07

快速CUDA推力自定義比較運算符

回答

相關問題