Hacker News new | ask | show | jobs
by thoughtFrame 305 days ago
I've read about the fast inverse square root trick, but it didn't occur to me that it can be used for other formulas or operations. Is this a common trick in DSP/GPU-like architectures nowadays?

And what's the mathematical basis? (that is, is this technique formalized anywhere?)

It seems insane to me that you run Newton's algorithm straight on the IEEE 754 format bits and it works, what with the exponent in excess coding and so on

2 comments

1/sqrt(x) is complicated. Imagine instead of computing 1/sqrt(x), imagine instead that you wanted to compute exp_2(-.5 log_2(x)). Also imagine you have an ultra fast way to compute exp_2 and log_2. If you have an ultra fast way to compute exp_2 and log_2, then exp_2(-.5 log_2(x)) is gonna be fast to compute.

It turns out you do have an ultra fast way to compute log_2: you bitcast a float to an integer, and then twiddle some bits. The first 8 bits (after the sign bit, which is obviously zero because we're assuming our input is positive) or whatever are the exponent, and the trailing 23 bits are a linear interpolation between 2^n and 2^(n+1) or whatever. exp_2 is the same but in reverse.

You can simply convert the integer to floating point, multiply by -.5, then convert back to integer. But multiplying -.5 by x can be applied to a floating point operating directly on its bits, but it's more complicated. You'll need to do some arithmetic, and some magic numbers.

So you're bitcasting to an integer, twiddling some bits, twiddling some bits, twiddling some bits, twiddling some bits, and bitcasting to a float. It turns out that all the bit twiddling simplifies if you do all the legwork, but that's beyond the scope of this post.

So there you go. You've computed exp_2(-.5 log_2 x). You're done. Now you need to figure out how to apply that knowledge to the inverse square root.

It just so happens that 1/sqrt(x) and exp(-.5 log x) are the same function. exp(-.5 log x) = exp(log(x^-.5)) = x^-.5 = 1/sqrt(x).

Any function where the hard parts are computing log_2 or exp_2 can be accelerated this way. For instance, x^y is just exp_2(y log_2 x).

Note that in fast inverse square root, you're not doing Newton's method on the integer part, you're doing it on the floating point part. Newton's method doesn't need to be done at all, it just makes the final result more accurate.

Here's a blog here that gets into the nitty gritty of how and why it works, and a formula to compute the magic numbers: https://h14s.p5r.org/2012/09/0x5f3759df.html

You can use the same techniques as fast inverse sqrt anywhere logs are useful. It's not particularly common these days because it's slower than a dedicated instruction and there are few situations where the application is both bottlenecked by numerical code and is willing to tolerate the accuracy issues. A pretty good writeup on how fast inverse sqrt works was done here: https://github.com/francisrstokes/githublog/blob/main/2024%2...

A lot of old-school algorithms like CORDIC went the same way.

There's a related technique to compute exponentials with FMA that's somewhat more useful in ML (e.g. softmax), but it has similar precision issues and activation functions are so fast relative to matmul that it's not usually worth it.