爲什麼在我的情況下多線程比順序編程慢？

我是新來的多線程，並嘗試通過一個簡單的程序來學習，它將1加到n並返回總和。在順序情況下，main爲n = 1e5和2e5調用sumFrom1函數兩次;在多線程情況下，使用pthread_create創建兩個線程，並且在單獨的線程中計算兩個和。多線程版本比順序版本慢得多（請參閱下面的結果）。我在12-CPU平臺上運行它，線程之間沒有通信。爲什麼在我的情況下多線程比順序編程慢？

多線程：

Thread 1 returns: 0 
Thread 2 returns: 0 
sum of 1..10000: 50005000 
sum of 1..20000: 200010000 
time: 156 seconds

順序：

sum of 1..10000: 50005000 
sum of 1..20000: 200010000 
time: 56 seconds

當我添加-02在編譯，多線程版本（787-9）的時間小於的順序版本（11S），但並不像我預期的那麼多。我始終可以使用-O2標誌，但我對未優化的情況下多線程的低速度感到好奇。它應該比順序版本慢嗎？如果不是，我能做些什麼來加快速度？

代碼：

#include <stdio.h> 
#include <pthread.h> 
#include <time.h> 

typedef struct my_struct 
{ 
    int n;                                        
    int sum;                                        
}my_struct_t;                                       

void *sumFrom1(void* sit)                                    
{                                          
    my_struct_t* local_sit = (my_struct_t*) sit;                               
    int i;                                        
    int nsim = 500000; // Loops for consuming time                                     
    int j;                                        

    for(j = 0; j < nsim; j++)                                   
    {                                         
    local_sit->sum = 0;                                     
    for(i = 0; i <= local_sit->n; i++)                                 
     local_sit->sum += i;                                    
    }  
} 

int main(int argc, char *argv[])                                  
{                                          
    pthread_t thread1;                                    
    pthread_t thread2;                                    
    my_struct_t si1;                                     
    my_struct_t si2;                                     
    int   iret1;                                     
    int   iret2;                                     
    time_t  t1;                                      
    time_t  t2;                                      


    si1.n = 10000;                                      
    si2.n = 20000;                                      

    if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version                                 
    {                                         
    t1 = time(0);                                      
    iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1);  
    iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2);                          
    pthread_join(thread1, NULL);                                  
    pthread_join(thread2, NULL);                                  
    t2 = time(0);                                      

    printf("Thread 1 returns: %d\n",iret1);                               
    printf("Thread 2 returns: %d\n",iret2);                               
    printf("sum of 1..%d: %d\n", si1.n, si1.sum);                              
    printf("sum of 1..%d: %d\n", si2.n, si2.sum);                              
    printf("time: %d seconds", t2 - t1);                                

    }                                         
    else  // Use "./prog" to test the time of sequential version                                       
    {                                         
    t1 = time(0);                                      
    sumFrom1((void*)&si1);                                    
    sumFrom1((void*)&si2);                                    
    t2 = time(0);                                      

    printf("sum of 1..%d: %d\n", si1.n, si1.sum);                              
    printf("sum of 1..%d: %d\n", si2.n, si2.sum);                              
    printf("time: %d seconds", t2 - t1); 
    }                        
    return 0;                       
}

UPDATE1：

的「假共享」有點谷歌搜索後（感謝@馬丁詹姆斯！），我認爲這是主要的原因。有（至少）兩種方式來解決這個問題：

第一種方法是將兩個結構之間的緩衝地帶（謝謝，@dasblinkenlight）：

my_struct_t si1; 
char   memHolder[4096]; 
my_struct_t si2;

沒有-02，時間消耗量從〜156s降至〜38s。

第二種方法是經常避免sumFrom1更新sit->sum，這可以使用一個臨時變量來實現（如@Jens Gustedt回答）：

for(int sum = 0, j = 0; j < nsim; j++)    
{ 
    sum = 0; 
    for(i = 0; i <= local_sit->n; i++) 
    sum += i; 
} 
local_sit->sum = sum;

沒有-O2，消耗的時間從減少〜 156s〜35s或〜109s（它有兩個高峯！我不知道爲什麼。）。用-O2，耗時約8秒。

來源

2012-04-11 cogitovita

在這樣的測試中，我們需要對結果進行平均。使用-O2優化運行測試多少次？如果你已經跑了好幾次，平均時間是多少？ – 2012-04-11 09:23:53

si1和si2彼此相鄰。虛假分享？ – 2012-04-11 09:32:56

@PavanManjunath感謝您的建議。我用-O2跑了10次。多線程版本的平均時間爲7.9秒，順序版本的平均時間爲11.7秒。波動很小。 – cogitovita 2012-04-11 09:37:34

通過修改代碼以

typedef struct my_struct 
{ 
    size_t n; 
    size_t sum; 
}my_struct_t; 

void *sumFrom1(void* sit) 
{ 
    my_struct_t* local_sit = sit; 
    size_t nsim = 500000; // Loops for consuming time 
    size_t n = local_sit->n; 
    size_t sum = 0; 
    for(size_t j = 0; j < nsim; j++) 
    { 
    for(size_t i = 0; i <= n; i++) 
     sum += i; 
    } 
    local_sit->sum = sum; 
    return 0; 
}

現象消失。您遇到的問題：

使用int的數據類型是完全錯誤的，這樣的測試。你的數字在那裏，總和溢出。簽名類型的溢出是未定義的行爲。你很幸運，它沒有吃你的午餐。
有間接和求和變量間接購買你額外的加載和存儲，在-O0的情況下，真的完成這樣，所有的假分享和類似的東西的影響。

你的代碼也觀察到其他錯誤：

缺失包括atoi
superflouous投地，並從void*
的time_t爲int

請編譯打印您的代碼-Wall之前pos婷。

來源

2012-04-11 10:26:33

使用'size_t sum = 0;'導致顯着的性能提升，然後添加這個'size_t n = local_sit-> n;'再次減慢速度。任何想法爲什麼？（全部用-O0編譯） – alk 2012-04-11 11:04:40

不，不是真的，我認爲討論非優化的代碼對於這個細節沒什麼意義，如果你想知道真正發生了什麼，第一步是查看用'-S'生成的彙編程序。 – 2012-04-11 11:18:24

爲什麼在我的情況下多線程比順序編程慢？

回答

相關問題