| Accept-reject methods are nonstarters when the architecture makes branching excessively expensive, specifically SIMD and GPU, which is one of the domains where generating random points on a sphere is particularly useful. The Box-Muller transform is slow because it requires log, sqrt, sin, and cos. Depending on your needs, you can approximate all of these. log2 can be easily approximated using fast inverse square root tricks: constexpr float fast_approx_log2(float x) {
x = std::bit_cast<int, float>(x);
constexpr float a = 1.0f / (1 << 23);
x *= a;
x -= 127.0f;
return x;
}
(conveniently, this also negates the need to ensure your input is not zero)sqrt is pretty fast; turn `-ffast-math` on. (this is already the default on GPUs) (remember that you're normalizing the resultant vector, so add this to the mag_sqr before square rooting it) The slow part of sin/cos is precise range reduction. We don't need that. The input to sin/cos Box-Muller is by construction in the range [0,2pi]. Range reduction is a no-op. For my particular niche, these approximations and the resulting biases are justified. YMMV. When I last looked at it, the fast log2 gave a bunch of linearities where you wanted it to be smooth, however across multiple dimensions these linearities seemed to cancel out. |
If you want fast sqrt (or more generally, if you care at all about not getting garbage), I would recommend using an explicit approx sqrt function in your programming language rather than turning on fastmath.