C++ OpenMP：寫入for循環內的矩陣顯着減慢for循環

我有以下代碼。 bitCount函數只是計算64位整數中的位數。 test函數是一個類似的例子，我正在做一些更復雜的代碼，我試圖在其中複製如何寫入矩陣顯着減慢for循環的性能，我試圖找出爲什麼它是這樣做的，以及是否有解決方案。C++ OpenMP：寫入for循環內的矩陣顯着減慢for循環

#include <vector> 
#include <cmath> 
#include <omp.h> 

// Count the number of bits 
inline int bitCount(uint64_t n){ 

    int count = 0; 

    while(n){ 

    n &= (n-1); 
    count++; 

    } 

    return count; 

} 


void test(){ 

    int nthreads = omp_get_max_threads(); 
    omp_set_dynamic(0); 
    omp_set_num_threads(nthreads); 

    // I need a priority queue per thread 
    std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY)); 
    std::vector<uint64_t> vals(100,1); 

    # pragma omp parallel for shared(mat,vals) 
    for(int i = 0; i < 100000000; i++){ 
    std::vector<double> &tid_vec = mat[omp_get_thread_num()]; 
    int total_count = 0; 
    for(unsigned int j = 0; j < vals.size(); j++){ 
     total_count += bitCount(vals[j]); 
     tid_vec[j] = total_count; // if I comment out this line, performance increase drastically 
    } 
    } 

}

此代碼在約11秒內運行。如果我註釋掉以下行：

tid_vec[j] = total_count;

該代碼在大約2秒鐘內運行。爲什麼在我的案例中寫矩陣的成本如此之高？

來源

2017-02-23 Cauchy

根據您的編譯器和選項，刪除序列化存儲時，內部循環縮減可能會被simd矢量化。 – tim18

沒有存儲的情況下，for循環也不會做任何事情。也許它被優化了？ –

如果你想要一個特定的答案，而不是隻是猜測，你必須提供關於編譯器版本，選項，硬件和[mcve]的詳細信息。另請注意，「bitcount」被廣泛稱爲「popcnt」，並已被優化爲遺忘。 – Zulan

既然你沒有提到你的編譯器/系統規格，我假設你正在編譯GCC並標記-O2 -fopenmp。

如果你對此有何評論行：

tid_vec[j] = total_count;

編譯器將優化掉所有的，其結果不使用的計算。因此：

total_count += bitCount(vals[j]);

也進行了優化。如果您的應用程序主內核沒有被使用，則程序運行得更快是有意義的。

另一方面，我不會自己實現一個位計數函數，而是依賴於已經提供給您的功能。例如，GCC builtin functions包括__builtin_popcount，這正是您正在嘗試執行的操作。

作爲一個好處：處理私有數據比處理使用不同數組元素的公共數組更好。它改善了局部性（當訪問內存不統一時，尤其重要，即NUMA），並可能減少訪問爭用。

# pragma omp parallel shared(mat,vals) 
{ 
std::vector<double> local_vec(1000,-INFINITY); 
#pragma omp for 
for(int i = 0; i < 100000000; i++) { 
    int total_count = 0; 
    for(unsigned int j = 0; j < vals.size(); j++){ 
    total_count += bitCount(vals[j]); 
    local_vec[j] = total_count; 
    } 
} 
// Copy local vec to tid_vec[omp_get_thread_num()] 
}

來源

2017-02-24 10:58:03

C++ OpenMP：寫入for循環內的矩陣顯着減慢for循環

回答

相關問題