如何使用GCC自動矢量化逐步寫入？

當使用-std=c99，-O3，和-mavx2，所述使用GCC 5.2編譯下面的代碼示例自動向量化（assembly here）：如何使用GCC自動矢量化逐步寫入？

#include <stdint.h> 

void test(uint32_t *restrict a, 
      uint32_t *restrict b) { 
    uint32_t *a_aligned = __builtin_assume_aligned(a, 32); 
    uint32_t *b_aligned = __builtin_assume_aligned(b, 32); 

    for (int i = 0; i < (1L << 10); i += 2) { 
    a_aligned[i] = 42 * b_aligned[i]; 
    a_aligned[i+1] = 3 * a_aligned[i+1]; 
    } 
}

但下面的代碼示例不會自動矢量化（assembly here）：

#include <stdint.h> 

void test(uint32_t *restrict a, 
      uint32_t *restrict b) { 
    uint32_t *a_aligned = __builtin_assume_aligned(a, 32); 
    uint32_t *b_aligned = __builtin_assume_aligned(b, 32); 

    for (int i = 0; i < (1L << 10); i += 2) { 
    a_aligned[i] = 42 * b_aligned[i]; 
    a_aligned[i+1] = a_aligned[i+1]; 
    } 
}

樣本之間的唯一區別是比例因子爲a_aligned[i+1]。

對於GCC 4.8,4.9和5.1也是如此。將volatile添加到a_aligned的聲明完全禁止自動矢量化。第一個樣本對於我們來說一直運行得比第二個樣本快，對於較小類型的加速更加明顯（例如uint8_t而不是uint32_t）。

有沒有辦法讓第二個代碼示例使用GCC自動向量化？

來源

2015-10-17 T. Wagner

所以唯一的區別是比例因子（3 vs沒有）？嘗試明確加1作爲縮放因子。如果解決了這個問題，這是一個編譯器錯誤。 – Jeff

或嘗試將'a_aligned [i + 1] = a_aligned [i + 1]'聲明註釋掉，或者將其重寫爲'a_aligned [i + 1] * = 1'。編譯器可能不知道如何處理您的無操作自我分配，而不是完全按照您所說的操作。 –

@Jeff確實，唯一的區別是比例因子。添加一個明確的1不會使第二個代碼示例自動向量化（[assembly here]（https://goo.gl/dnjSaQ））。 –

以下版本vectorises，但如果你問我，是醜陋的...

#include <stdint.h> 

void test(uint32_t *a, uint32_t *aa, 
      uint32_t *restrict b) { 
    #pragma omp simd aligned(a,aa,b:32) 
    for (int i = 0; i < (1L << 10); i += 2) { 
    a[i] = 2 * b[i]; 
    a[i+1] = aa[i+1]; 
    } 
}

要編譯-fopenmp與test(a, a, b)打電話。

來源

2015-10-18 12:59:13 Gilles

這種方法有兩點需要注意。首先是你在'a'和'aa'上失去'restrict'關鍵字，這可能導致你讀回超過你需要的'aa'。提供的示例代碼很好（在這種情況下，如果'a'和'aa'別名以麻煩的方式，GCC會分支到非矢量化代碼），但總的來說，它可能會導致讀取次數超過必需的次數。例如，考慮'b'是否被'a'替換;在這種情況下，會生成許多不需要的額外'vmovdq'指令。 –

第二個是'-O3'，GCC會自動內聯'test（）'。這意味着在'test（）'內聯的地方，它會識別'a'是'aa'，並且無法自動向量化。這可以通過GCC的'__attribute__（（noinline））'修復，但是這仍然會導致函數調用的開銷。 –

如何使用GCC自動矢量化逐步寫入？

回答

相關問題