Very early games had lookup tables for trig functions. The cpu instructions were too slow or missing. The tables were either generated at run time or statically defined in the code.
I think that’s one of those things Jai and Zig agree on - compile time functions have a place in preventing magic numbers that cannot be debugged.
Yes, and at least for sqrt(), internally it's likely implemented as a heuristic guess followed by a fixed number of iterations of Newton's Method. (In software, you'd normally iterate Newton's Method until the change in the result is less than some threshold; in hardware, I'm guessing that it might be simpler to figure out the maximum number of iterations that would ever be needed for any input, and always run that many, but I don't know.)
> In software, you'd normally iterate Newton's Method
Software normally computes trigonometric functions (and other complicated ones like exponents and std::erf) with a high-degree polynomial approximation.
> how does that hardware implementation work internally?
I don’t know, but based on performance difference between FP32 and FP64 square root instructions, the implementation probably produces 4-5 bits of mantissa per cycle.
Interesting that your linked algorithm manages to avoid costly divisions, but it uses an even longer loop than Newton's Method -- one iteration for every 2 bits. NM converges quadratically, doubling the number of correct bits each time, so a 32-bit number won't ever need more than 5 iterations, assuming >= 1 bit of accuracy in the initial estimate, which is easy to obtain just by shifting right by half the offset of the highest 1-bit.
There are trade-offs (constant time, perhaps?) and many differing applications ...
For example: Pipelined RISC DSP chips have fat (many parallel streams) "one FFT result per clock cycle" pipelines that are rock solid (no cache hits or jitter).
The setup takes a few cyces but once primed it's
aquired data -> ( pipeline ) -> processed data
every clock cycle (with a pipeline delay, of course).
In that domain hardware implementations are chosen to work well with vector calculations and with consistent capped timings.
( To be clear, I haven't looked into that specific linked algo, I'm just pointing out it's not a N.R. only world )
sqrt is the one exception to this. the newton series is really good and the polynomials aren't great (and the newton based approach prevents you from having to do range reduction)
yeah, that's very likely the explanation. All these functions are pretty high latency instructions, vs rejection sampling which only involves a multiplication. On Nvidia GPUs, mul has latency of 1-4 cycles while others are 16-32.
I think that’s one of those things Jai and Zig agree on - compile time functions have a place in preventing magic numbers that cannot be debugged.