2011-06-08 58 views
4

我正在尋找固定點16.16數字的最佳反平方根算法。下面的代碼是我到目前爲止(但基本上它取平方根和除以原始數字,我想得到反平方根沒有劃分)。如果它改變了任何東西,代碼將被編譯爲armv5te。固定點逆sqrt

uint32_t INVSQRT(uint32_t n) 
{ 
    uint64_t op, res, one; 
    op = ((uint64_t)n<<16); 
    res = 0; 
    one = (uint64_t)1 << 46; 
    while (one > op) one >>= 2; 
    while (one != 0) 
    { 
     if (op >= res + one) 
     { 
      op -= (res + one); 
      res += (one<<1); 
     } 
     res >>= 1; 
     one >>= 2; 
    } 
    res<<=16; 
    res /= n; 
    return(res); 
} 
+4

Pedantry:推測你的意思是*倒數*平方根? – 2011-06-08 23:25:30

+2

http://en.wikipedia.org/wiki/Fast_inverse_square_root? – Guerrero 2011-06-08 23:29:37

+0

^我正準備回覆那個 – Jonathan 2011-06-08 23:30:23

回答

3

訣竅是採用牛頓法的問題X - 1/Y^2 = 0。因此,給定的x,使用迭代解決方案用於Y。

Y_(n+1) = y_n * (3 - x*y_n^2)/2 

除以2只是一個位移,或者最壞的情況是乘以0.5。這個方案恰好按要求收斂到y = 1/sqrt(x),並且根本沒有任何真正的分歧。

唯一的問題是,你需要一個體面的y值。我記得,迭代收斂的估計值y有限制。

+1

您可以使用指數搜索找到一個體面的起點,找到x * y * y - 1更改符號的區間,然後在該區間中使用割線方法,然後使用該方法中的牛頓。 – 2011-06-09 00:58:53

+0

但是如何看代碼?你認爲它會比我已有的效率更高嗎? – Jonathan 2011-06-09 09:49:42

+1

此代碼將每次迭代結果中的正確數字數量加倍。因此,如果您使用的是16位數字,則需要大約4次迭代才能收斂,如果從第一個近似值的1個正確數字開始。它如何看待代碼?它看起來非常像我上面寫的單行。 – 2011-06-09 11:10:29

3

ARMv5TE處理器提供了一個快速整數乘法器和一個「計數前導零」指令。他們通常也會有中等大小的緩存。基於此,對於高性能實現而言,最合適的方法似乎是初始近似的表格查找,然後是兩個Newton-Raphson迭代,以實現完全準確的結果。我們可以進一步加速這些迭代的第一步,並將其納入表中,這是Cray計算機四十年前使用的一種技術。

下面的函數fxrsqrt()實現了這種方法。它以一個8位近似值r開始,參數爲a的倒數平方根,但不是存儲r,而是每個表元素存儲3r(在32位條目的低10位中)和(在32位條目的高位22位)。這允許快速計算第一次迭代爲 r = 1.5 * r - a * r 。第二次迭代,然後以常規的方式計算爲r = 0.5 * R *(3 - R 1 *(R * A))。

爲了能夠準確地執行這些計算,無論輸入的大小如何,參數a在計算開始時被歸一化,本質上表示爲2.32定點數乘以比例因子爲2 scal。在計算結束時,根據公式1/sqrt對結果進行非規範化(2 2n)= 2 -n。通過捨棄最重要的丟棄位爲1的結果,提高了準確性,導致幾乎所有結果都被正確舍入。

該代碼使用兩個輔助函數:__clz()確定非零32位參數中前導零位的數量。 __umulhi()計算兩個無符號32位整數的完整64位乘積的32個最高有效位。這兩個函數都應該通過編譯器內部函數或通過使用一些內聯彙編來實現。在下面的代碼中,我展示了適用於ARM CPU的便攜式實現以及用於x86平臺的內聯彙編版本。在ARMv5TE平臺__clz()應該映射映射到CLZ指令,而__umulhi()應該映射到UMULL

#include <stdio.h> 
#include <stdlib.h> 
#include <stdint.h> 
#include <math.h> 

#define USE_OWN_INTRINSICS 1 

#if USE_OWN_INTRINSICS 
__forceinline int __clz (uint32_t a) 
{ 
    int r; 
    __asm__ ("bsrl %1,%0\n\t" : "=r"(r): "r"(a)); 
    return 31 - r; 
} 

uint32_t __umulhi (uint32_t a, uint32_t b) 
{ 
    uint32_t r; 
    __asm__ ("movl %1,%%eax\n\tmull %2\n\tmovl %%edx,%0\n\t" 
      : "=r"(r) : "r"(a), "r"(b) : "eax", "edx"); 
    return r; 
} 
#else // USE_OWN_INTRINSICS 
int __clz (uint32_t a) 
{ 
    uint32_t r = 32; 
    if (a >= 0x00010000) { a >>= 16; r -= 16; } 
    if (a >= 0x00000100) { a >>= 8; r -= 8; } 
    if (a >= 0x00000010) { a >>= 4; r -= 4; } 
    if (a >= 0x00000004) { a >>= 2; r -= 2; } 
    r -= a - (a & (a >> 1)); 
    return r; 
} 

uint32_t __umulhi (uint32_t a, uint32_t b) 
{ 
    return (uint32_t)(((uint64_t)a * b) >> 32); 
} 
#endif // USE_OWN_INTRINSICS 

/* 
* For each sub-interval in [1, 4), use an 8-bit approximation r to reciprocal 
* square root. To speed up subsequent Newton-Raphson iterations, each entry in 
* the table combines two pieces of information: The least-significant 10 bits 
* store 3*r, the most-significant 22 bits store r**3, rounded from 24 down to 
* 22 bits such that accuracy is optimized. 
*/ 
uint32_t rsqrt_tab [96] = 
{ 
    0xfa0bdefa, 0xee6af6ee, 0xe5effae5, 0xdaf27ad9, 
    0xd2eff6d0, 0xc890aec4, 0xc10366bb, 0xb9a71ab2, 
    0xb4da2eac, 0xadce7ea3, 0xa6f2b29a, 0xa279a694, 
    0x9beb568b, 0x97a5c685, 0x9163067c, 0x8d4fd276, 
    0x89501e70, 0x8563da6a, 0x818ac664, 0x7dc4fe5e, 
    0x7a122258, 0x7671be52, 0x72e44a4c, 0x6f68fa46, 
    0x6db22a43, 0x6a52623d, 0x67041a37, 0x65639634, 
    0x622ffe2e, 0x609cba2b, 0x5d837e25, 0x5bfcfe22, 
    0x58fd461c, 0x57838619, 0x560e1216, 0x53300a10, 
    0x51c72e0d, 0x50621a0a, 0x4da48204, 0x4c4c2e01, 
    0x4af789fe, 0x49a689fb, 0x485a11f8, 0x4710f9f5, 
    0x45cc2df2, 0x448b4def, 0x421505e9, 0x40df5de6, 
    0x3fadc5e3, 0x3e7fe1e0, 0x3d55c9dd, 0x3d55d9dd, 
    0x3c2f41da, 0x39edd9d4, 0x39edc1d4, 0x38d281d1, 
    0x37bae1ce, 0x36a6c1cb, 0x3595d5c8, 0x3488f1c5, 
    0x3488fdc5, 0x337fbdc2, 0x3279ddbf, 0x317749bc, 
    0x307831b9, 0x307879b9, 0x2f7d01b6, 0x2e84ddb3, 
    0x2d9005b0, 0x2d9015b0, 0x2c9ec1ad, 0x2bb0a1aa, 
    0x2bb0f5aa, 0x2ac615a7, 0x29ded1a4, 0x29dec9a4, 
    0x28fabda1, 0x2819e99e, 0x2819ed9e, 0x273c3d9b, 
    0x273c359b, 0x2661dd98, 0x258ad195, 0x258af195, 
    0x24b71192, 0x24b6b192, 0x23e6058f, 0x2318118c, 
    0x2318718c, 0x224da189, 0x224dd989, 0x21860d86, 
    0x21862586, 0x20c19183, 0x20c1b183, 0x20001580 
}; 

/* This function computes the reciprocal square root of its 16.16 fixed-point 
* argument. After normalization of the argument if uses the most significant 
* bits of the argument for a table lookup to obtain an initial approximation 
* accurate to 8 bits. This is followed by two Newton-Raphson iterations with 
* quadratic convergence. Finally, the result is denormalized and some simple 
* rounding is applied to maximize accuracy. 
* 
* To speed up the first NR iteration, for the initial 8-bit approximation r0 
* the lookup table supplies 3*r0 along with r0**3. A first iteration computes 
* a refined estimate r1 = 1.5 * r0 - x * r0**3. The second iteration computes 
* the final result as r2 = 0.5 * r1 * (3 - r1 * (r1 * x)). 
* 
* The accuracy for all arguments in [0x00000001, 0xffffffff] is as follows: 
* 639 results are too small by one ulp, 1457 results are too big by one ulp. 
* A total of 2096 results deviate from the correctly rounded result. 
*/ 
uint32_t fxrsqrt (uint32_t a) 
{ 
    uint32_t s, r, t, scal; 

    /* handle special case of zero input */ 
    if (a == 0) return ~a; 
    /* normalize argument */ 
    scal = __clz (a) & 0xfffffffe; 
    a = a << scal; 
    /* initial approximation */ 
    t = rsqrt_tab [(a >> 25) - 32]; 
    /* first NR iteration */ 
    r = (t << 22) - __umulhi (t, a); 
    /* second NR iteration */ 
    s = __umulhi (r, a); 
    s = 0x30000000 - __umulhi (r, s); 
    r = __umulhi (r, s); 
    /* denormalize and round result */ 
    r = ((r >> (18 - (scal >> 1))) + 1) >> 1; 
    return r; 
} 

/* reference implementation, 16.16 reciprocal square root of non-zero argment */ 
uint32_t ref_fxrsqrt (uint32_t a) 
{ 
    double arg = a/65536.0; 
    double rsq = sqrt (1.0/arg); 
    uint32_t r = (uint32_t)(rsq * 65536.0 + 0.5); 
    return r; 
} 

int main (void) 
{ 
    uint32_t arg = 0x00000001; 
    uint32_t res, ref; 
    uint32_t err, lo = 0, hi = 0; 

    do { 
     res = fxrsqrt (arg); 
     ref = ref_fxrsqrt (arg); 

     err = 0; 
     if (res < ref) { 
      err = ref - res; 
      lo++; 
     } 
     if (res > ref) { 
      err = res - ref; 
      hi++; 
     } 
     if (err > 1) { 
      printf ("!!!! arg=%08x res=%08x ref=%08x\n", arg, res, ref); 
      return EXIT_FAILURE; 
     } 
     arg++; 
    } while (arg); 
    printf ("results too low: %u too high: %u not correctly rounded: %u\n", 
      lo, hi, lo + hi); 
    return EXIT_SUCCESS; 
}