需要幫助優化代碼（最低圖像約定）

我寫了一些模擬代碼，並使用「隨機中斷GDB」調試方法。我發現我的計劃的99.9％的時間在這個例程（它是最小影像約定）花：需要幫助優化代碼（最低圖像約定）

inline double distanceSqPeriodic(double const * const position1, double const * const position2, double boxWidth) { 
     double xhw, yhw, zhw, x, y, z;                     

     xhw = boxWidth/2.0;                      
     yhw = xhw;                      
     zhw = xhw;                      

     x = position2[0] - position1[0];                    
     if (x > xhw) 
       x -= boxWidth;                      
     else if (x < -xhw) 
       x += boxWidth;                      


     y = position2[1] - position1[1];                    
     if (y > yhw) 
       y -= boxWidth;                      
     else if (y < -yhw) 
       y += boxWidth;                      


     z = position2[2] - position1[2];                    
     if (z > zhw) 
       z -= boxWidth;                      
     else if (z < -zhw) 
       z += boxWidth;                      

     return x * x + y * y + z * z;                     
}

至今（也許不是很顯著的）我已經執行的優化：

返回的距離，而不是平方根的平方
內聯它
常量我所能
沒有標準庫膨脹
編譯每g ++優化標誌我能想到

我用盡了我能做的事情。也許我可以使用花車而不是雙打，但我寧願這是最後的手段。也許我可以以某種方式在此上使用SIMD，但我從來沒有這樣做過，所以我想這是很多工作。有任何想法嗎？

感謝

來源

2013-02-25 Nick

這將有助於看到呼叫代碼，特別是如果你想要去SIMD路線。理解參數也會有幫助，例如每個呼叫的'boxWidths []'中的值是不同的？ – 2013-02-25 15:01:15

調用方法次數更少（即算法改進）。當然，不知道它是否適用;） – 2013-02-25 15:02:12

@Paul R No ... d'oh！讓我解決這個問題 – Nick 2013-02-25 15:02:26

首先，你沒有使用正確的算法。如果兩點大於boxWidth，會怎麼樣？其次，如果你有多個粒子，調用一個完成所有距離計算的單個函數，並將結果放到輸出緩衝區中將會顯着提高效率。內聯有助於減少一些，但不是全部。任何預先計算 - 例如在算法中將盒子長度除以2 - 將在不需要時重複。

這裏是一些SIMD代碼來做計算。您需要使用-msse4進行編譯。使用-O3，在我的機器上（macbook pro，llvm-gcc-4.2），我的速度提高了約2倍。這確實需要使用32位浮點數而不是雙精度算術。

上證所真的不是那麼複雜，它只是看起來可怕。例如而不是寫一個* b，你必須編寫笨重的_mm_mul_ps（a，b）。

#include <stdio.h> 
#include <stdlib.h> 
#include <time.h> 
#include <smmintrin.h> 

// you can compile this code with -DDOUBLE to try using doubles vs. floats 
// in the unoptimized code. The SSE code uses only floats. 
#ifdef DOUBLE 
typedef double real; 
#else 
typedef float real; 
#endif 

static inline __m128 loadFloat3(const float const* value) { 
    // Load (x,y,z) into a SSE register, leaving the last entry 
    // set to zero. 
    __m128 x = _mm_load_ss(&value[0]); 
    __m128 y = _mm_load_ss(&value[1]); 
    __m128 z = _mm_load_ss(&value[2]); 
    __m128 xy = _mm_movelh_ps(x, y); 
    return _mm_shuffle_ps(xy, z, _MM_SHUFFLE(2, 0, 2, 0)); 
} 

int fdistanceSqPeriodic(float* position1, float* position2, const float boxWidth, 
          float* out, const int n_points) { 
    int i; 
    __m128 r1, r2, r12, s12, r12_2, s, box, invBox; 

    box = _mm_set1_ps(boxWidth); 
    invBox = _mm_div_ps(_mm_set1_ps(1.0f), box); 

    for (i = 0; i < n_points; i++) { 
    r1 = loadFloat3(position1); 
    r2 = loadFloat3(position1); 
    r12 = _mm_sub_ps(r1, r2); 

    s12 = _mm_mul_ps(r12, invBox); 
    s12 = _mm_sub_ps(s12, _mm_round_ps(s12, _MM_FROUND_TO_NEAREST_INT)); 
    r12 = _mm_mul_ps(box, s12); 

    r12_2 = _mm_mul_ps(r12, r12); 

    // double horizontal add instruction accumulates the sum of 
    // all four elements into each of the elements 
    // (e.g. s.x = s.y = s.z = s.w = r12_2.x + r12_2.y + r12_2.z + r12_2.w) 
    s = _mm_hadd_ps(r12_2, r12_2); 
    s = _mm_hadd_ps(s, s); 

    _mm_store_ss(out++, s); 
    position1 += 3; 
    position2 += 3; 
    } 
    return 1; 
} 

inline real distanceSqPeriodic(real const * const position1, real const * const position2, real boxWidth) { 
    real xhw, yhw, zhw, x, y, z;                     

    xhw = boxWidth/2.0;                      
    yhw = xhw;                      
    zhw = xhw;                      

    x = position2[0] - position1[0];                    
    if (x > xhw) 
    x -= boxWidth;                      
    else if (x < -xhw) 
    x += boxWidth;                      


    y = position2[1] - position1[1];                    
    if (y > yhw) 
    y -= boxWidth;                      
    else if (y < -yhw) 
    y += boxWidth;                      


    z = position2[2] - position1[2];                    
    if (z > zhw) 
    z -= boxWidth;                      
    else if (z < -zhw) 
    z += boxWidth;                      

    return x * x + y * y + z * z;                     
} 


int main(void) { 
    real* position1; 
    real* position2; 
    real* output; 
    int n_runs = 10000000; 
    posix_memalign((void**) &position1, 16, n_runs*3*sizeof(real)); 
    posix_memalign((void**) &position2, 16, n_runs*3*sizeof(real)); 
    posix_memalign((void**) &output, 16, n_runs*sizeof(real)); 
    real boxWidth = 1.8; 
    real result = 0; 
    int i; 
    clock_t t; 

#ifdef OPT 
    printf("Timing optimized SSE implementation\n"); 
#else 
    printf("Timinig original implementation\n"); 
#endif 

#ifdef DOUBLE 
    printf("Using double precision\n"); 
#else 
    printf("Using single precision\n"); 
#endif 

    t = clock(); 
#ifdef OPT 
    fdistanceSqPeriodic(position1, position2, boxWidth, output, n_runs); 
#else 
    for (i = 0; i < n_runs; i++) { 
    *output = distanceSqPeriodic(position1, position2, boxWidth); 
    position1 += 3; 
    position2 += 3; 
    output++; 
    } 
#endif 
    t = clock() - t; 

    printf("It took me %d clicks (%f seconds).\n", (int) t, ((float)t)/CLOCKS_PER_SEC); 
}

來源

2013-08-05 12:03:52

嘿，這太棒了！感謝你這樣做。我一定會在我的模擬相關項目中使用它。 – Nick 2013-08-06 01:30:45

沒問題。我基本上是從我爲MDTraj（http://rmcgibbo.github.io/mdtraj/）寫的代碼複製粘貼的。我還沒有真正分析過這個時機 - 從玩這個代碼看起來最重要的是內存帶寬和對齊。一旦字節傳送到CPU或從CPU傳送出去，實際的計算本身就很便宜。 – 2013-08-06 05:20:37

正確的鏈接到mdtraj現在是https://github.com/mdtraj/mdtraj – max 2016-10-10 20:52:02

Return the square of the distance instead of the square root

說只要是個好主意，因爲你是比較方塊是正方形的。

Inline it

這有時是一個反優化：內聯的代碼佔用空間，在執行流水線/高速緩存，無論是支與否。

通常它沒有區別，因爲編譯器對是否內聯是否有最終的說法。

Const what I can

通常根本沒有差異。

No standard library bloat

臃腫？

Compiling with every g++ optimization flag I can think of

這很好：將大部分優化留給編譯器。只有當你衡量你真正的瓶頸，並確定這個瓶頸是否顯着時，才需要在手上進行優化。

你可以試試做的是讓你的代碼無分支。如果不使用位掩碼，這可能是這樣的：

//if (z > zhw) 
//  z -= boxWidths[2]; 
//else if (z < -zhw) 
//  z += boxWidths[2]; 

const auto z_a[] = { 
    z, 
    z - boxWidths[2] 
}; 
z = z_a[z>zhw]; 
...

或

z -= (z>zhw) * boxWidths[2];

然而，沒有保證，這是速度更快。您的編譯器現在可能難以在您的代碼中識別SIMD點，或者分支目標緩衝區做得很好，而且大部分時間您的函數都有相同的代碼路徑。

來源

2013-02-25 15:13:05

這實際上減慢了它的相當多。 – Nick 2013-02-25 15:25:14

您需要擺脫比較，因爲這些比較難以預測。

的功能被實現爲：

//   /\ /\ 
    //  /\/ \ 
    ----0----- or ------------ , as (-x)^2 == x^2 
    //
    //

後者是二腹肌語句的結果：

x = abs(half-abs(diff))+half;

代碼

double tst(double a[4], double b[4], double half) 
{ 
    double sum=0.0,t; 
    int i; 
    for (i=0;i<3;i++) { t=fabs(fabs(b[i]-a[i])-half)-half; sum+=t*t;} 
    return sum; 
}

由節拍原來的實現因子四（+一些） - 在這一點上甚至沒有完全平行：只有xmm註冊的下半部分rs被使用。

通過並行處理x & & y，理論上可以獲得約50％的收益。理論上使用花車而不是雙打可以使速度提高約3倍。

來源

2013-02-25 15:39:28

這些比較與算術表達式一樣難以預測。我認爲你的意思是if語句（即分支） – 2013-02-25 15:48:03

比較_implies_分支，就像在LUT [i> j]的另一種方法中那樣。 – 2013-02-25 18:38:14

例如在'z - =（z> zhw）* boxWidths [2];'沒有分支，而是比較。或者假設'const bool foobar = x 無分支（至少不在x86上） – 2013-02-26 11:27:51

您可能想要使用晶圓廠（在ISO 90 C中標準化），因爲這應該能夠被縮減爲單個非分支指令。

來源

2013-02-25 16:11:54 tletnes

需要幫助優化代碼（最低圖像約定）

回答

相關問題