_umul128

在Visual C++中，當針對Windows 32位時_umul128未定義。針對Win32時，兩個無符號64位整數如何相乘？該解決方案只需在針對Windows 32位的Visual C++ 2017上工作。_umul128

2017-10-22 Augusto

https://msdn.microsoft.com/en-us/library/windows/desktop/aa383711(v=vs.85).aspx –

你需要做很多嗎？這可能值得使用SSE2或AVX2，具體取決於你想要做什麼。雖然可能不會，因爲沒有SIMD附加功能。但是，請參閱https://stackoverflow.com/questions/17863411/sse-multiplication-of-2-64-bit-integers獲取64位結果。不過，這需要更多的指令來獲得128b的結果。 https://stackoverflow.com/questions/6738283/can-xmm-registers-be-used-to-do-any-128-bit-integer-math –

[SIMD使用無符號乘法對64位* 64位進行簽名到128位]（https://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit），可能加速32位，如果你有AVX2或AVX512，使用'adc' +'mul'（或BMI2'mulx'）作爲大數組數組的構建塊，但是一次一個標量可能會更好。 –

我發現下面的代碼（從xmrrig），這似乎做的工作就好了：

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{ 
    // multiplier = ab = a * 2^32 + b 
    // multiplicand = cd = c * 2^32 + d 
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d 
    uint64_t a = multiplier >> 32; 
    uint64_t b = multiplier & 0xFFFFFFFF; 
    uint64_t c = multiplicand >> 32; 
    uint64_t d = multiplicand & 0xFFFFFFFF; 

    //uint64_t ac = a * c; 
    uint64_t ad = a * d; 
    //uint64_t bc = b * c; 
    uint64_t bd = b * d; 

    uint64_t adbc = ad + (b * c); 
    uint64_t adbc_carry = adbc < ad ? 1 : 0; 

    // multiplier * multiplicand = product_hi * 2^64 + product_lo 
    uint64_t product_lo = bd + (adbc << 32); 
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0; 
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry; 

    return product_lo; 
}

來源

2017-10-25 02:43:29 Augusto

不幸的是，這個編譯MSVC真的很差。對於uint64 * uint64 => 64位乘法，它使用'__allmul'的函數調用，即使兩個輸入的上半部分都是編譯時0.也就是說，當它可以使用一個'mul'指令來執行32x32 => 64b全乘。 https://godbolt.org/g/53Qxyo。如果32x32 => 64b乘法的內在原因，它會使效率更高。鏗鏘很好地編譯它;還有很多工作要做，但它對所有隨身攜帶使用'adc'指令。 MSVC使用一些'adc'，但爲其他分支：/ –

此外，請確保*不要*在內部可用的64位代碼中使用它。（我猜如果編譯器提供它，使用相同的名稱是確保編譯時錯誤的好方法）。它以64位模式編譯，但不是一個'mul'指令。 –

如果你需要這個在32位代碼中實際上很快，可以考慮使用http://gmplib.org/。他們有一個32位的asm實現乘法，但是我沒有看到2個肢體* 2個肢體= 4個肢體的特殊情況。當你通過這些尺寸時，IDK通用函數的速度有多快。 https://gmplib.org/repo/gmp/file/tip/mpn/x86/mul_basecase.asm –

這個答案已經從MSVC 32位模式優化了對方的回答某個版本的xmrrig function的。原始版本與其他編譯器，特別是鏗鏘有關。

我看着@ Augusto的功能MSVC的輸出，它真的很糟糕。 對於32x32 => 64b乘法使用__emulu明顯改善了它（因爲MSVC是愚蠢的，並且在已知輸入實際上僅爲32位且上半部分爲零的情況下不優化uint64_t * uint64_t = uint64_t）。其他編譯器（gcc和clang）生成一個單一的mul指令，而不是調用輔助函數。 MSVC的代碼還有其他問題，我不知道如何通過調整源代碼來修復，儘管。我認爲，如果你想在該編譯器上獲得良好的性能，那麼必須使用內聯asm（或單獨編譯的asm函數）。

如果您需要更靈活的任意精度（更大的數字），請參閱GMPlib's low-level functions，它們有asm實現，而不是嘗試從此__umul128中構建256b乘法。但如果你確切需要這個，那麼值得嘗試。使用C++支持恆定傳播和CSE優化，您不會用asm獲得。

鐺編譯此沒有大的問題，實際使用adc所有附加有進位（除了一個，它與setc指令保存）。 MSVC在進位檢查上分支，並且只是產生令人討厭的代碼。海灣合作委員會也沒有做得很好，有一些分支隨身攜帶。（因爲gcc不知道如何把carry = sum < a成adc，gcc bug 79173。）

IDK如果MSVC或GCC支持任何附加與攜帶的用於在32位模式的64位整數內部函數。 _addcarry_u64generates poor code with gcc anyway (in 64-bit mode)，但ICC可能會好。 IDK關於MSVC。

如果你想要一個asm實現，我建議使用這個函數的鏗鏘5.0的輸出。你可能手工找到一些優化，但肯定比MSVC更好。但當然，https://gcc.gnu.org/wiki/DontUseInlineAsm中的大部分參數都適用：如果乘以內聯變爲常數的內容，或者上半部分已知爲零，則阻塞恆定傳播是一個主要問題。

Source + asm output for MSVC 32-bit and clang5.0 32-bit on Godbolt

漂亮的代碼鏗鏘。有點不好的代碼與MSVC，但比以前更好。有點不好與gcc也（沒有改變與其他答案）。

#include <stdint.h> 

#ifdef _MSC_VER 
# include <intrin.h> 
#else 
// MSVC doesn't optimize 32x32 => 64b multiplication without its intrinsic 
// But good compilers can just use this to get a single mul instruction 
static inline 
uint64_t __emulu(uint32_t x, uint32_t y) { 
    return x * (uint64_t)y; 
} 
#endif 

// This is still pretty ugly with MSVC, branching on the carry 
// and using XMM store/integer reload to zero a register! 
// But at least it inlines 4 mul instructions 
// instead of calling a generic 64x64 => 64b multiply helper function 
uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{ 
    // multiplier = ab = a * 2^32 + b 
    // multiplicand = cd = c * 2^32 + d 
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d 
    uint64_t a = multiplier >> 32; 
    uint64_t b = (uint32_t)multiplier; // & 0xFFFFFFFF; 
    uint64_t c = multiplicand >> 32; 
    uint64_t d = (uint32_t)multiplicand; // & 0xFFFFFFFF; 

    //uint64_t ac = __emulu(a, c); 
    uint64_t ad = __emulu(a, d); 
    //uint64_t bc = __emulu(b, c); 
    uint64_t bd = __emulu(b, d); 

    uint64_t adbc = ad + __emulu(b , c); 
    uint64_t adbc_carry = (adbc < ad); // ? 1 : 0; 
    // MSVC gets confused by the ternary and makes worse code than using a boolean in an integer context for 1 : 0 

    // multiplier * multiplicand = product_hi * 2^64 + product_lo 
    uint64_t product_lo = bd + (adbc << 32); 
    uint64_t product_lo_carry = (product_lo < bd); // ? 1 : 0; 
    *product_hi = __emulu(a , c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry; 

    return product_lo; 
}

請確保您只在32位代碼中使用它。在64位代碼中，它無法優化到單個64位mul指令（這會產生全部結果的64位二分之一）。實現GNU C++擴展的編譯器（clang，gcc，ICC）可以使用unsigned __int128並獲得很好的代碼。例如a * (unsigned __int128)b產生128b結果。（Godbolt示例）。

來源

2017-10-25 04:59:58

回答

相關問題