| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by simonbyrne 2900 days ago
	It is worth noting that with AVX-512, Intel has introduced a native inverse sqrt approximation (VRSQRT14).

3 comments

mmozeiko 2900 days ago

Inverse sqrt approximation is available since SSE1 with rsqrtss & rsqrtps instructions.

link

slavik81 2900 days ago

Which is nice because SSE1 and SSE2 are mandatory parts of x86_64. If you're a 64bit application for desktop, you can use rsqrtss without any checks or fallbacks.

Unfortunately, it doesn't tend to get used automatically in languages like C. The result of rsqrtss is slightly different from 1/sqrtf(x) as two seperate operations, so it cannot be applied as an optimization.

If the rules for floating point optimization are loosened by passing -ffast-math to GCC, the compiler will use it. That being said, -ffast-math is a shotgun that affects a lot of things. If you need signed zeros, Infs, NaNs or denormals that flag may break your program.

link

MaxBarraclough 2900 days ago

> -ffast-math is a shotgun that affects a lot of things

Interesting point. GCC and MSVC both seem to have (incompatible) intrinsic functions, for what that's worth.

https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/X86-Built-in-Fu...

https://docs.microsoft.com/en-us/previous-versions/visualstu...

link

mmozeiko 2899 days ago

That is not quite true. You don't need to use GCC __builtin functions for this. GCC supports SSE1 intrinsics like _mm_rsqrt_ss exactly same as MSVC - it is declared in xmmintrin.h header. Just include it and _mm_rsqrt_ss/ps will be available for you in gcc and msvc.

link

edynoid 2899 days ago

I find it quite fortunate, that they don't use it automatically. Introducing a 1e-3 relative error is quite a deal breaker for some. Not for games sure, but for science that is mostly unacceptable.

link

meta_AU 2899 days ago

From memory, GCC does one NewtonRaphson iteration on the approximate result so the error is much lower (closer to e-9 from memory again). They don't use the approximation directly in fast-math mode.

link

convolvatron 2899 days ago

this is wildly off topic, but can anyone from either the scientific or the graphics community comment on the practical impact of losing denorms? i certainly understand that it softens the impact of underflow, but does anyone care?

link

vardump 2900 days ago

Indeed.

Both reciprocal (inverse) square root SSE SIMD instructions were available in Intel Pentium III, released in 1999.

link

simonbyrne 2898 days ago

Ah, I should have done more googling. I guess the AVX-512 ones are marginally more accurate?

link

zbjornson 2900 days ago

VRSQRT28 too, which has max 2^-28 rel error.

https://software.intel.com/en-us/articles/reference-implemen...

link

piyush_soni 2900 days ago

Thanks, I came here to ask the similar question about native optimizations on this 'hack'. Apologies for my lack of knowledge, but I'm a little confused on which one to use out of all these variants while compiling C++ code on a 64 bit platform for the standard 'float' type inverse square root? Are there varying levels of compromise between speed and accuracy among all these methods? Thanks ...

link

stephencanon 2899 days ago

For a generic 64b platform, use RSQRTSS/RSQRTPS, since it's the only one that will exist. The others are specific to rather new hardware.

My recollection is that it's accurate to 11.5 bits, so after one refinement step you have nearly full precision (an error bound of a couple ULP). Check Intel's docs for more details.

link

piyush_soni 2899 days ago

Thanks!

link

stephencanon 2899 days ago

Note that VRSQRT28 is in AVX-512ER, which is Xeon Phi only.

link

robin_reala 2900 days ago

How does that perform in comparison?

link

brandmeyer 2900 days ago

rsqrt{p,s}s has guaranteed relative error <= 1.5 * 2^-12, or about 3.6e-4. According to Agner Fog, it typically executes in one cycle. I would assume that the AVX512 versions are similar.

link