爲什麼不同整數大小的數組具有不同的性能？

的寫入時間到std::array爲int8，int16，int32和int64與每個大小增加一倍。我可以理解8位CPU的這種行爲，但不是32/64位。

爲什麼32位系統需要比保存8位值多4倍的時間來保存32位值？

這裏是我的測試代碼：

#include <iostream> 
#include <array> 
#include <chrono> 

std::array<std::int8_t, 64 * 1024 * 1024> int8Array; 
std::array<std::int16_t, 64 * 1024 * 1024> int16Array; 
std::array<std::int32_t, 64 * 1024 * 1024> int32Array; 
std::array<std::int64_t, 64 * 1024 * 1024> int64Array; 

void PutZero() 
{ 
    auto point1 = std::chrono::high_resolution_clock::now(); 
    for (auto &v : int8Array) v = 0; 
    auto point2 = std::chrono::high_resolution_clock::now(); 
    for (auto &v : int16Array) v = 0; 
    auto point3 = std::chrono::high_resolution_clock::now(); 
    for (auto &v : int32Array) v = 0; 
    auto point4 = std::chrono::high_resolution_clock::now(); 
    for (auto &v : int64Array) v = 0; 
    auto point5 = std::chrono::high_resolution_clock::now(); 
    std::cout << "Time of processing int8 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point2 - point1)).count() << "us." << std::endl; 
    std::cout << "Time of processing int16 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point3 - point2)).count() << "us." << std::endl; 
    std::cout << "Time of processing int32 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point4 - point3)).count() << "us." << std::endl; 
    std::cout << "Time of processing int64 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point5 - point4)).count() << "us." << std::endl; 
} 

int main() 
{ 
    PutZero(); 
    std::cout << std::endl << "Press enter to exit" << std::endl; 
    std::cin.get(); 
    return 0; 
}

我編譯它在linux下有：g++ -o array_issue_1 main.cpp -O3 -std=c++14

和我的結果如下：

Time of processing int8 array: 9922us. 
Time of processing int16 array: 37717us. 
Time of processing int32 array: 76064us. 
Time of processing int64 array: 146803us.

如果我-O2，那麼結果編譯是int8差5倍！

您也可以在Windows中編譯此源代碼。你會得到類似的結果之間的關係。

更新＃1

當我和-02編譯，然後我的結果如下：

Time of processing int8 array: 60182us. 
Time of processing int16 array: 77807us. 
Time of processing int32 array: 114204us. 
Time of processing int64 array: 186664us.

我沒有分析彙編輸出。我的主要觀點是，我想用C++編寫高效的代碼和諸如此類的東西，從性能角度來看，像std::array這樣的東西可能具有挑戰性，並且以某種方式違反直覺。

來源

2017-10-18 wrealcon

你看過裝配輸出了嗎？ –

我[寫了一個基準測試]（https://github.com/travisdowns/uarch-bench），其中包括測試：「BYTE」，「WORD」，「DWORD」和「QWORD」的代價。其結果是，它們都在現代英特爾CPU上完全採用1個週期，除非它們跨越高速緩存線邊界，此時它們需要2個週期。請注意，如果您在字節顆粒位置隨機寫入，則較大的值更可能跨越緩存行邊界。實際上，這個基準和大多數代碼將使用對齊的緩衝區，所以根本不會出現跨越邊界。 – BeeOnRope

你的「主要觀點」是亂碼，你可以把它編輯清楚嗎？ PS除非在算法/庫和C++抽象機器及其對象和數據類型的層次上，否則不應該擔心在C++中「高效」，直到遇到並找到特定的瓶頸爲止，此時您仍應該尋求解決方案算法/數據類型/對象級別，然後是分配器，所有這些仍然使用C++，如果你曾經關心過機器代碼，那麼你應該解決機器代碼而不是C++中的問題。 – philipxy

爲什麼32位系統需要比保存8位值多4倍的時間來保存32位值？

它沒有。但是你的基準測試有3個不同的問題會給你帶來這些結果。

你不是在錯誤的內存。因此，您在基準測試期間會對陣列進行頁面錯誤檢測。這些頁面錯誤以及OS內核交互是當時的主要因素。
帶有-O3的編譯器通過將所有循環轉換爲memset()完全擊敗了您的基準測試。
你的基準是內存限制的。所以你正在測量內存的速度而不是核心。

問題1：測試數據不Prefaulted

你的數組的聲明，但基準之前沒有使用。由於內核和內存分配的工作方式，它們尚未映射到內存中。只有當你第一次碰到他們時，纔會發生這種情況。而當它發生時，內核映射頁面會導致非常大的損失。

這可以通過在基準測試之前觸摸所有陣列來完成。

沒有預斷裂作用：http://coliru.stacked-crooked.com/a/1df1f3f9de420d18

g++ -O3 -Wall main.cpp && ./a.out 
Time of processing int8 array: 28983us. 
Time of processing int16 array: 57100us. 
Time of processing int32 array: 113361us. 
Time of processing int64 array: 224451us.

帶預斷裂作用：http://coliru.stacked-crooked.com/a/7e62b9c7ca19c128

g++ -O3 -Wall main.cpp && ./a.out 
Time of processing int8 array: 6216us. 
Time of processing int16 array: 12472us. 
Time of processing int32 array: 24961us. 
Time of processing int64 array: 49886us.

的時間由4換句話說粗略的因素下降，原來的基準測試測量更多的內核比實際的代碼。

問題2：編譯器打敗基準

編譯器識別您的書寫零的模式與通話是完全更換所有的循環，以memset()。因此，實際上，您使用不同的尺寸來測量對memset()的呼叫。

call std::chrono::_V2::system_clock::now() 
    xor esi, esi 
    mov edx, 67108864 
    mov edi, OFFSET FLAT:int8Array 
    mov r14, rax 
    call memset 
    call std::chrono::_V2::system_clock::now() 
    xor esi, esi 
    mov edx, 134217728 
    mov edi, OFFSET FLAT:int16Array 
    mov r13, rax 
    call memset 
    call std::chrono::_V2::system_clock::now() 
    xor esi, esi 
    mov edx, 268435456 
    mov edi, OFFSET FLAT:int32Array 
    mov r12, rax 
    call memset 
    call std::chrono::_V2::system_clock::now() 
    xor esi, esi 
    mov edx, 536870912 
    mov edi, OFFSET FLAT:int64Array 
    mov rbp, rax 
    call memset 
    call std::chrono::_V2::system_clock::now()

這樣做的優化是-ftree-loop-distribute-patterns。即使你關閉它，矢量化器也會給你類似的效果。

隨着-O2，向量化和模式識別都被禁用。所以編譯器會給你你寫的東西。

.L4: 
    mov BYTE PTR [rax], 0   ;; <<------ 1 byte at a time 
    add rax, 1 
    cmp rdx, rax 
    jne .L4 
    call std::chrono::_V2::system_clock::now() 
    mov rbp, rax 
    mov eax, OFFSET FLAT:int16Array 
    lea rdx, [rax+134217728] 
.L5: 
    xor ecx, ecx 
    add rax, 2 
    mov WORD PTR [rax-2], cx  ;; <<------ 2 bytes at a time 
    cmp rdx, rax 
    jne .L5 
    call std::chrono::_V2::system_clock::now() 
    mov r12, rax 
    mov eax, OFFSET FLAT:int32Array 
    lea rdx, [rax+268435456] 
.L6: 
    mov DWORD PTR [rax], 0  ;; <<------ 4 bytes at a time 
    add rax, 4 
    cmp rax, rdx 
    jne .L6 
    call std::chrono::_V2::system_clock::now() 
    mov r13, rax 
    mov eax, OFFSET FLAT:int64Array 
    lea rdx, [rax+536870912] 
.L7: 
    mov QWORD PTR [rax], 0  ;; <<------ 8 bytes at a time 
    add rax, 8 
    cmp rdx, rax 
    jne .L7 
    call std::chrono::_V2::system_clock::now()

隨着-O2：http://coliru.stacked-crooked.com/a/edfdfaaf7ec2882e

g++ -O2 -Wall main.cpp && ./a.out 
Time of processing int8 array: 28414us. 
Time of processing int16 array: 22617us. 
Time of processing int32 array: 32551us. 
Time of processing int64 array: 56591us.

現在很明顯，較小的字長速度較慢。但是，如果所有字的大小都是相同的速度，你會期望時間是平坦的。他們不是因爲內存帶寬。

問題3：內存帶寬

由於基準（如寫入）僅寫入零，很容易飽和的芯/系統存儲器帶寬。所以基準會受到多少內存被觸動的影響。

爲了解決這個問題，我們需要縮小數據集以適應緩存。爲了彌補這一點，我們多次循環相同的數據。

std::array<std::int8_t, 512> int8Array; 
std::array<std::int16_t, 512> int16Array; 
std::array<std::int32_t, 512> int32Array; 
std::array<std::int64_t, 512> int64Array; 

... 

auto point1 = std::chrono::high_resolution_clock::now(); 
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int8Array) v = 0; 
auto point2 = std::chrono::high_resolution_clock::now(); 
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int16Array) v = 0; 
auto point3 = std::chrono::high_resolution_clock::now(); 
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int32Array) v = 0; 
auto point4 = std::chrono::high_resolution_clock::now(); 
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int64Array) v = 0; 
auto point5 = std::chrono::high_resolution_clock::now();

現在我們看到很多更扁平的不同字大小時序：

http://coliru.stacked-crooked.com/a/f534f98f6d840c5c

g++ -O2 -Wall main.cpp && ./a.out 
Time of processing int8 array: 20487us. 
Time of processing int16 array: 21965us. 
Time of processing int32 array: 32569us. 
Time of processing int64 array: 26059us.

之所以說它不是完全平坦的，可能是因爲有涉及編譯器優化的許多其他因素。您可能需要使用循環展開來拉近距離。

來源

2017-10-18 18:32:08 Mysticial

這很有幫助，但問題是爲什麼'mov QWORD'花費兩倍於mov DWORD？ – wally

@rex它沒有。 OP認爲情況是這樣的唯一原因是因爲編譯器打破了基準。我想我會澄清這一點。 – Mysticial

好的，所以'-O2'你看到了不同字號的相似時間嗎？我正在嘗試複製，並針對不同的字號查看[差異]（http://coliru.stacked-crooked.com/a/2b9e3e32649afcb5）。即使使用'-O2'。我也用MSVC得到了不同（每次是以前的兩倍）。 asm指令以25344，47447，89533和178087的次數報告爲「rep stos byte」，「rep stos word」，「rep stos dword」和「rep stos qword」。所以我想了解答案可能是這不是真的發生。 – wally

爲什麼不同整數大小的數組具有不同的性能？

回答

相關問題