推力：sort_by_key與zip_iterator性能

我使用sort_by_key與使用zip_iterator傳遞的價值觀。這個sort_by_key被稱爲很多次，經過一定的迭代後，它變得慢了10x ！ 性能下降的原因是什麼？

症狀

我使用sort_by_key排序3個矢量，它們中的一個作爲密鑰矢量：

struct Segment 
{ 
    int v[2]; 
}; 

thrust::device_vector<int> keyVec; 
thrust::device_vector<int> valVec; 
thrust::device_vector<Segment> segVec; 

// ... code which fills these vectors ... 

thrust::sort_by_key(keyVec.begin(), keyVec.end(), 
        make_zip_iterator(make_tuple(valVec.begin(), segVec.begin())));

載體的大小通常爲約400萬。在最初的2次調用中，sort_by_key需要0.04s，在循環3中需要0.1s，然後在剩餘的循環中進一步降低到0.3s。因此，我們發現性能下降了10倍。

thrust::device_vector<int> indexVec(keyVec.size()); 
thrust::sequence(indexVec.begin(), indexVec.end()); 

// Sort the keys and indexes 
thrust::sort_by_key(keyVec.begin(), keyVec.end(), indexVec.begin()); 

thrust::device_vector<int> valVec2(keyVec.size()); 
thrust::device_vector<Segment> segVec2(keyVec.size()); 

// Use index array and move vectors to destination 
moveKernel<<< x, y >>>(
    toRawPtr(indexVec), 
    indexVec.size(), 
    toRawPtr(valVec), 
    toRawPtr(segVec), 
    toRawPtr(valVec2), 
    toRawPtr(segVec2)); 

// Swap back into original vectors 
valVec.swap(valVec2); 
segVec.swap(segVec2);

該手寫排序需要0.03秒，這：

額外信息

要確保退化的唯一因素是sort_by_key，我用手寫的內核取代了以上人工分揀性能在所有迭代中都是一致的，與sort_by_key和zip_iterator的性能下降不同。

來源

2011-04-22 Ashwin Nanjappa

這仍然是一個問題與推力1.6？ – harrism 2012-09-13 03:43:27

對於每個循環的準確定時，您需要在每個循環結束時使用cudaThreadSynchronize。前兩個循環獲得的時間可能不是您正在尋找的實際時間。

來源

2011-04-22 19:03:21

Pavan：在使用cudaThreadSynchronize之前，我注意到了時間和Windows高分辨率定時器API的使用時間。 – 2011-04-23 00:55:58

推力：sort_by_key與zip_iterator性能

回答

相關問題