在使用多線程FFTW時增加執行時間

我是FFTW庫的新手。我已經使用FFTW庫成功實現了一維和二維fft。我將我的2D fft代碼轉換爲多線程2D fft。但結果完全相反。多線程二維FFT代碼比串行化二維FFT代碼花費的時間更長。我在某處失去了某些東西。我遵循FFTW documentation中給出的所有說明來並行化代碼。在使用多線程FFTW時增加執行時間

這是我並行2D FFT C程序

#include <mpi.h> 
#include <fftw3.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <math.h> 
#include <time.h> 

#define N 2000 
#define M 2000 
#define index(i, j) (j + i*M) 

int i, j; 

void get_input(fftw_complex *in) { 
    for(i=0;i<N;i++){ 
     for(j=0;j<M;j++){ 
      in[index(i, j)][0] = sin(i + j); 
      in[index(i, j)][1] = sin(i * j); 
     } 
    } 
} 

void show_out(fftw_complex *out){ 
    for(i=0;i<N;i++){ 
     for(j=0;j<M;j++){ 
      printf("%lf %lf \n", out[index(i, j)][0], out[index(i, j)][1]); 
     } 
    } 
} 

int main(){ 
    clock_t start, end; 
    double time_taken; 
    start = clock(); 

    int a = fftw_init_threads(); 
    printf("%d\n", a); 
    fftw_complex *in, *out; 
    fftw_plan p; 

    in = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex)); 
    out = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex)); 
    get_input(in); 

    fftw_plan_with_nthreads(4); 
    p = fftw_plan_dft_2d(N, M, in, out, FFTW_FORWARD, FFTW_ESTIMATE); 

    fftw_execute(p); 

    /*p = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE); 
    fftw_execute(p); 
    puts("In Real Domain"); 
    show_out(out);*/ 

    fftw_destroy_plan(p); 

    fftw_free(in); 
    fftw_free(out); 
    fftw_cleanup_threads(); 

    end = clock(); 
    time_taken = ((double) (end - start))/CLOCKS_PER_SEC; 
    printf("%g \n", time_taken); 

    return 0; 
}

是否有人可以幫我指出我在做什麼錯誤？

來源

2017-08-30 Latish Pavan

你實際上有多少（實際 - 不是超線程的）CPU核心？ – twalberg

@twalberg它是四個。 –

與4相比，單線程運行需要多長時間？你有沒有試過只運行2個線程？由於線程相關的開銷，加速與線程數對於太多線程來說會變慢。 – atru

這種行爲是不正確的綁定的典型。

一般來說，OpenMP線程應該都綁定到相同套接字的核心，以避免NUMA效應（這可能使性能不理想，甚至最差）。

此外，請確保MPI任務綁定正確（一個任務應綁定到來自同一個套接字的多個內核，並且每個內核應使用一個OpenMP線程）。

由於MPI，您的OpenMP線程最終需要分時共享。

首先，我建議您開始打印MPI和OpenMP綁定。

如何實現，這取決於MPI庫和OpenMP運行時。如果您使用Open MPI和英特爾編譯器，你可以KMP_AFFINITY=verbose mpirun --report-bindings --tag-output ...

然後，正如前面所說，我建議你開始容易，並增加複雜性

1 MPI任務和1個OpenMP的線程
1個MPI任務X OpenMP的線程（x是一個內核上的一個插座的數量）
X MPI任務和每個任務1個的OpenMP螺紋
x每條任務MPI任務和y的OpenMP線程

有希望地，2.將會比1更快，4會比3更快。

來源

2017-08-31 00:28:28

在使用多線程FFTW時增加執行時間

回答

相關問題