自動矢量比較

我的問題讓我的g ++ 5.4使用矢量化進行比較。基本上我想比較使用矢量化的4個未簽名的整數。我的第一個方法是直截了當：自動矢量比較

bool compare(unsigned int const pX[4]) { 
    bool c1 = (temp[0] < 1); 
    bool c2 = (temp[1] < 2); 
    bool c3 = (temp[2] < 3); 
    bool c4 = (temp[3] < 4); 
    return c1 && c2 && c3 && c4; 
}

與g++ -std=c++11 -Wall -O3 -funroll-loops -march=native -mtune=native -ftree-vectorize -msse -msse2 -ffast-math -fopt-info-vec-missed編譯告訴是，它無法向量化的比較，由於未對齊的數據：

main.cpp:5:17: note: not vectorized: failed to find SLP opportunities in basic block. 
main.cpp:5:17: note: misalign = 0 bytes of ref MEM[(const unsigned int *)&x] 
main.cpp:5:17: note: misalign = 4 bytes of ref MEM[(const unsigned int *)&x + 4B] 
main.cpp:5:17: note: misalign = 8 bytes of ref MEM[(const unsigned int *)&x + 8B] 
main.cpp:5:17: note: misalign = 12 bytes of ref MEM[(const unsigned int *)&x + 12B]

因此，我的第二次嘗試，告訴G ++對齊數據並使用臨時陣列：

bool compare(unsigned int const pX[4]) { 
    unsigned int temp[4] __attribute__ ((aligned(16))); 
    temp[0] = pX[0]; 
    temp[1] = pX[1]; 
    temp[2] = pX[2]; 
    temp[3] = pX[3]; 

    bool c1 = (temp[0] < 1); 
    bool c2 = (temp[1] < 2); 
    bool c3 = (temp[2] < 3); 
    bool c4 = (temp[3] < 4); 
    return c1 && c2 && c3 && c4; 
}

但是，輸出相同。我的CPU支持AVX2，英特爾固有指南告訴我， _mm256_cmpgt_epi8/16/32/64作比較。任何想法如何告訴g ++使用它？

來源

2016-12-06 user1228633

不知道如果有一個可移植的方式來做到這一點的結果結合起來，但如果你只是想看看是否所有的'bool's設置或者沒有[intrinsics]（https://software.intel.com/sites/landingpage/IntrinsicsGuide/），它會告訴你它們是否通過位計數等都是錯誤的[intel甚至有一個例子]（https：/ /software.intel.com/en-us/blogs/2013/05/17/processing-arrays-of-bits-with-intel-advanced-vector-extensions-2-intel-avx2） – Mgetz

沒有32位無符號比較在SSE/AVX - 嘗試與簽名。 –

AVX2需要32字節對齊 –

好吧，顯然編譯器不喜歡「展開的循環」。這個工作對我來說：

bool compare(signed int const pX[8]) { 
    signed int const w[] __attribute__((aligned(32))) = {1,2,3,4,5,6,7,8}; 
    signed int out[8] __attribute__((aligned(32))); 

    for (unsigned int i = 0; i < 8; ++i) { 
     out[i] = (pX[i] <= w[i]); 
    } 

    bool temp = true; 
    for (unsigned int i = 0; i < 8; ++i) { 
     temp = temp && out[i]; 
     if (!temp) { 
      return false; 
     } 
    } 
    return true; 
}

請注意，out也是signed int。現在我只需要一個快速的方式來保存在out

來源

2016-12-06 21:02:18 user1228633

我也發現展開循環對於編譯器來說是有問題的。快速索引上的#omp編譯指示應該進行矢量化，並且您可能需要求和深度位深度和。另一種方法是將2D [n，m]共同表示爲1D [n * m]的聯合，然後編譯器自然很容易。 – Holmz

自動矢量比較

回答

相關問題