Hacker News new | ask | show | jobs
by nwallin 305 days ago
Accept-reject methods are nonstarters when the architecture makes branching excessively expensive, specifically SIMD and GPU, which is one of the domains where generating random points on a sphere is particularly useful.

The Box-Muller transform is slow because it requires log, sqrt, sin, and cos. Depending on your needs, you can approximate all of these.

log2 can be easily approximated using fast inverse square root tricks:

    constexpr float fast_approx_log2(float x) {
      x = std::bit_cast<int, float>(x);
      constexpr float a = 1.0f / (1 << 23);
      x *= a;
      x -= 127.0f;
      return x;
    }
(conveniently, this also negates the need to ensure your input is not zero)

sqrt is pretty fast; turn `-ffast-math` on. (this is already the default on GPUs) (remember that you're normalizing the resultant vector, so add this to the mag_sqr before square rooting it)

The slow part of sin/cos is precise range reduction. We don't need that. The input to sin/cos Box-Muller is by construction in the range [0,2pi]. Range reduction is a no-op.

For my particular niche, these approximations and the resulting biases are justified. YMMV. When I last looked at it, the fast log2 gave a bunch of linearities where you wanted it to be smooth, however across multiple dimensions these linearities seemed to cancel out.

3 comments

fastmath is absolutely not the default on any GPU compiler I have worked with (including the one I wrote).

If you want fast sqrt (or more generally, if you care at all about not getting garbage), I would recommend using an explicit approx sqrt function in your programming language rather than turning on fastmath.

I've read about the fast inverse square root trick, but it didn't occur to me that it can be used for other formulas or operations. Is this a common trick in DSP/GPU-like architectures nowadays?

And what's the mathematical basis? (that is, is this technique formalized anywhere?)

It seems insane to me that you run Newton's algorithm straight on the IEEE 754 format bits and it works, what with the exponent in excess coding and so on

1/sqrt(x) is complicated. Imagine instead of computing 1/sqrt(x), imagine instead that you wanted to compute exp_2(-.5 log_2(x)). Also imagine you have an ultra fast way to compute exp_2 and log_2. If you have an ultra fast way to compute exp_2 and log_2, then exp_2(-.5 log_2(x)) is gonna be fast to compute.

It turns out you do have an ultra fast way to compute log_2: you bitcast a float to an integer, and then twiddle some bits. The first 8 bits (after the sign bit, which is obviously zero because we're assuming our input is positive) or whatever are the exponent, and the trailing 23 bits are a linear interpolation between 2^n and 2^(n+1) or whatever. exp_2 is the same but in reverse.

You can simply convert the integer to floating point, multiply by -.5, then convert back to integer. But multiplying -.5 by x can be applied to a floating point operating directly on its bits, but it's more complicated. You'll need to do some arithmetic, and some magic numbers.

So you're bitcasting to an integer, twiddling some bits, twiddling some bits, twiddling some bits, twiddling some bits, and bitcasting to a float. It turns out that all the bit twiddling simplifies if you do all the legwork, but that's beyond the scope of this post.

So there you go. You've computed exp_2(-.5 log_2 x). You're done. Now you need to figure out how to apply that knowledge to the inverse square root.

It just so happens that 1/sqrt(x) and exp(-.5 log x) are the same function. exp(-.5 log x) = exp(log(x^-.5)) = x^-.5 = 1/sqrt(x).

Any function where the hard parts are computing log_2 or exp_2 can be accelerated this way. For instance, x^y is just exp_2(y log_2 x).

Note that in fast inverse square root, you're not doing Newton's method on the integer part, you're doing it on the floating point part. Newton's method doesn't need to be done at all, it just makes the final result more accurate.

Here's a blog here that gets into the nitty gritty of how and why it works, and a formula to compute the magic numbers: https://h14s.p5r.org/2012/09/0x5f3759df.html

You can use the same techniques as fast inverse sqrt anywhere logs are useful. It's not particularly common these days because it's slower than a dedicated instruction and there are few situations where the application is both bottlenecked by numerical code and is willing to tolerate the accuracy issues. A pretty good writeup on how fast inverse sqrt works was done here: https://github.com/francisrstokes/githublog/blob/main/2024%2...

A lot of old-school algorithms like CORDIC went the same way.

There's a related technique to compute exponentials with FMA that's somewhat more useful in ML (e.g. softmax), but it has similar precision issues and activation functions are so fast relative to matmul that it's not usually worth it.

The range reduction of sin/cos is slow only because of the stupid leftover from the 19th century that is the measuring of the phases and plane angles in radians.

A much better unit for phase and plane angle is the cycle, where range reduction becomes exact and very fast.

The radian had advantages for symbolic computation done with pen and paper, by omitting a constant multiplicative factor in the derivation and integration formulae.

However, even for numeric computations done with pen and paper, the radian was inconvenient so in the 19th century the sexagesimal degrees and the cycles continued to be used in parallel with the radians.

Since the development of automatic computers there has remained no reason whatsoever to use the radian for anything. Radians are never used in input sensors or output actuators, because that can be done only with low accuracy, but in physical inputs and outputs angles are always measured in fractions of a cycle. Computing derivatives or integrals happens much more seldom than other trigonometric function evaluations. Moreover, the functions that are derived or integrated almost always have another multiplicative factor in their argument, e.g. frequency or wavenumber, so that factor will also appear in the derivation/integration formula and the extra multiplication with 2Pi that appears in the derivative (or with 1/2Pi that appears in the integral) can usually be absorbed in that factor, or in any case it can be done only once for a great number of function evaluations. Therefore switching the phase unit of measurement from radian to cycle greatly reduces the amount of required computation, while also increasing the accuracy of the results.

A mathematical library should implement only the trigonometric functions of 2Pi*x, and their inverses, which suffice for all applications. There is no need for the cos and sin functions with arguments in radians, which are provided by all standard libraries now.

For reasons that are completely mysterious for me, the C standard and the IEEE standard of FP arithmetic have been updated to include the trigonometric functions of Pi*x, not the functions of 2Pi*x.

It is completely beyond my power of comprehension why this has happened. All the existing applications want to measure the phases in cycles, not in half-cycles. At most there are some applications where measuring the phases or plane angles in right angles could be more convenient, i.e. those might like trigonometric functions of (Pi/2)*x, but again not even in those cases there is any use for half-cycles.

So with the functions of the updated math libraries, e.g. cospi and sinpi, you hopefully can avoid the slow range reduction, but you still have to add a superfluous scaling by 2, due to the bad function definitions.

Similar considerations apply to the use of binary logarithms and exponentials instead of the slower and less accurate hyperbolic (a.k.a. natural) logarithms and exponentials.