-1
我正在試驗下面的一段代碼,以比較 串行和並行對於(包括非lambda和lambda)的性能。
TBB lambda vs自寫體物體
#include<iostream>
#include<chrono>
#include <ctime>
#include<fstream>
#include<stdlib.h>
#define MAX 10000000
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"
using namespace std;
using namespace tbb;
void squarecalc(int a)
{
a *= a;
}
void serial_apply_square(int* a)
{
for (int i = 0; i<MAX; i++)
squarecalc(*(a + i));
}
class apply_square
{
int* my_a;
public:
void operator()(const blocked_range<size_t>& r) const
{
int *a = my_a;
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
apply_square(int* a) :my_a(a){}
};
void parallel_apply_square(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), apply_square(a));
}
void parallel_apply_square_lambda(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), [=](const blocked_range<size_t>& r)
{
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
);
}
int main()
{
std::chrono::time_point<std::chrono::system_clock> start, end;
int i = 0;
int* a = new int[MAX];
fstream of;
of.open("newfile", ios::in);
while (i<MAX)
{
of >> a[i];
i++;
}
start = std::chrono::system_clock::now();
serial_apply_square(a);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
cout << "\nTime for serial execution :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [without lambda] :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square_lambda(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [with lambda] :" << elapsed_seconds.count() << endl;
free(a);
}
總之它只是計算千萬數字的串行和並行兩種方式的平方。下面是我爲 目標代碼的多次執行獲得的輸出結果。
**1st execution**
Time for serial execution :0.043183
Time for parallel execution [without lambda] :0.035238
Time for parallel execution [with lambda] :0.036719
**2nd execution**
Time for serial execution :0.043252
Time for parallel execution [without lambda] :0.035403
Time for parallel execution [with lambda] :0.036811
**3rd execution**
Time for serial execution :0.043241
Time for parallel execution [without lambda] :0.035355
Time for parallel execution [with lambda] :0.036558
**4th execution**
Time for serial execution :0.043216
Time for parallel execution [without lambda] :0.035491
Time for parallel execution [with lambda] :0.036697
想到並行執行時間比串行執行較低的所有案件 次,我是古董爲什麼拉姆達法時間比的其他水貨版本在身體的對象是自書面更高 。
- 爲什麼拉姆達版本總是會花更多的時間?
- 是否因爲編譯器的開銷而創建了自己的主體 對象?
- 如果上述問題的答案是肯定的,那麼lambda版本 就不如自己寫的版本了?
編輯
下面是優化代碼的結果(等級-02)
**1st execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.00055
Time for parallel execution [with lambda] :1e-05
**2nd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000583
Time for parallel execution [with lambda] :9e-06
**3rd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000554
Time for parallel execution [with lambda] :9e-06
現在優化的代碼似乎顯示串行部分 更好的效果蘭巴部分時間得到改善。
這是否意味着並行代碼性能總是需要使用 優化代碼進行測試?
什麼編譯器,你用什麼優化leveldo?你是否意識到任何體面的編譯器都應該用一個no-op來替換'squarecalc'的調用,因爲你不通過引用傳遞參數,而是通過值? – MikeMB 2015-04-04 07:48:41
squarecalc是有意設計用於通過價值傳遞參數,因爲我的主要目標是計算時間,而我對獲取方格最不感興趣。我正在使用g ++版本4.6.3並且沒有應用優化 – sjsam 2015-04-04 08:08:13
對不起,但在那種情況下,我不能幫你。爭論未經優化的代碼的性能毫無意義。正如我所說:如果它將被優化的代碼,你可能只會測量開銷('squarecalc'例如永遠不會被調用)。 – MikeMB 2015-04-04 08:24:12