Hacker News new | ask | show | jobs
by simonbyrne 2900 days ago
It is worth noting that with AVX-512, Intel has introduced a native inverse sqrt approximation (VRSQRT14).
3 comments

Inverse sqrt approximation is available since SSE1 with rsqrtss & rsqrtps instructions.
Which is nice because SSE1 and SSE2 are mandatory parts of x86_64. If you're a 64bit application for desktop, you can use rsqrtss without any checks or fallbacks.

Unfortunately, it doesn't tend to get used automatically in languages like C. The result of rsqrtss is slightly different from 1/sqrtf(x) as two seperate operations, so it cannot be applied as an optimization.

If the rules for floating point optimization are loosened by passing -ffast-math to GCC, the compiler will use it. That being said, -ffast-math is a shotgun that affects a lot of things. If you need signed zeros, Infs, NaNs or denormals that flag may break your program.

> -ffast-math is a shotgun that affects a lot of things

Interesting point. GCC and MSVC both seem to have (incompatible) intrinsic functions, for what that's worth.

https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/X86-Built-in-Fu...

https://docs.microsoft.com/en-us/previous-versions/visualstu...

That is not quite true. You don't need to use GCC __builtin functions for this. GCC supports SSE1 intrinsics like _mm_rsqrt_ss exactly same as MSVC - it is declared in xmmintrin.h header. Just include it and _mm_rsqrt_ss/ps will be available for you in gcc and msvc.
I find it quite fortunate, that they don't use it automatically. Introducing a 1e-3 relative error is quite a deal breaker for some. Not for games sure, but for science that is mostly unacceptable.
From memory, GCC does one NewtonRaphson iteration on the approximate result so the error is much lower (closer to e-9 from memory again). They don't use the approximation directly in fast-math mode.
this is wildly off topic, but can anyone from either the scientific or the graphics community comment on the practical impact of losing denorms? i certainly understand that it softens the impact of underflow, but does anyone care?
Indeed.

Both reciprocal (inverse) square root SSE SIMD instructions were available in Intel Pentium III, released in 1999.

Ah, I should have done more googling. I guess the AVX-512 ones are marginally more accurate?
VRSQRT28 too, which has max 2^-28 rel error.

https://software.intel.com/en-us/articles/reference-implemen...

Thanks, I came here to ask the similar question about native optimizations on this 'hack'. Apologies for my lack of knowledge, but I'm a little confused on which one to use out of all these variants while compiling C++ code on a 64 bit platform for the standard 'float' type inverse square root? Are there varying levels of compromise between speed and accuracy among all these methods? Thanks ...
For a generic 64b platform, use RSQRTSS/RSQRTPS, since it's the only one that will exist. The others are specific to rather new hardware.

My recollection is that it's accurate to 11.5 bits, so after one refinement step you have nearly full precision (an error bound of a couple ULP). Check Intel's docs for more details.

Thanks!
Note that VRSQRT28 is in AVX-512ER, which is Xeon Phi only.
How does that perform in comparison?
rsqrt{p,s}s has guaranteed relative error <= 1.5 * 2^-12, or about 3.6e-4. According to Agner Fog, it typically executes in one cycle. I would assume that the AVX512 versions are similar.