| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nxobject 546 days ago
	I don't know how numerics in hardware works, but would the use of functions like sin, cos, sqrt incur a penalty as well, even if only a slight one? It's really fascinating to think about how all of this would work.

3 comments

hinkley 546 days ago

Very early games had lookup tables for trig functions. The cpu instructions were too slow or missing. The tables were either generated at run time or statically defined in the code.

I think that’s one of those things Jai and Zig agree on - compile time functions have a place in preventing magic numbers that cannot be debugged.

link

akoboldfrying 546 days ago

Yes, and at least for sqrt(), internally it's likely implemented as a heuristic guess followed by a fixed number of iterations of Newton's Method. (In software, you'd normally iterate Newton's Method until the change in the result is less than some threshold; in hardware, I'm guessing that it might be simpler to figure out the maximum number of iterations that would ever be needed for any input, and always run that many, but I don't know.)

link

Const-me 546 days ago

> at least for sqrt(), internally it's likely implemented as a heuristic guess

Square roots are implemented in hardware: https://www.felixcloutier.com/x86/sqrtsd

> In software, you'd normally iterate Newton's Method

Software normally computes trigonometric functions (and other complicated ones like exponents and std::erf) with a high-degree polynomial approximation.

link

akoboldfrying 546 days ago

>Square roots are implemented in hardware

But how does that hardware implementation work internally?

The point I'm trying to make is that it is probably an (in-hardware) loop that uses Newton's Method.

ETA: The point being that, although in the source code it looks like all looks have been eliminated, they really haven't been if you dig deeper.

link

Const-me 545 days ago

> how does that hardware implementation work internally?

I don’t know, but based on performance difference between FP32 and FP64 square root instructions, the implementation probably produces 4-5 bits of mantissa per cycle.

link

defrost 546 days ago

There are other methods used in hardware, eg (for example)

https://en.wikipedia.org/wiki/Methods_of_computing_square_ro...

Something like Heron's method is a special case of Newton's method.

link

akoboldfrying 546 days ago

Interesting that your linked algorithm manages to avoid costly divisions, but it uses an even longer loop than Newton's Method -- one iteration for every 2 bits. NM converges quadratically, doubling the number of correct bits each time, so a 32-bit number won't ever need more than 5 iterations, assuming >= 1 bit of accuracy in the initial estimate, which is easy to obtain just by shifting right by half the offset of the highest 1-bit.

link

defrost 546 days ago

There are trade-offs (constant time, perhaps?) and many differing applications ...

For example: Pipelined RISC DSP chips have fat (many parallel streams) "one FFT result per clock cycle" pipelines that are rock solid (no cache hits or jitter).

The setup takes a few cyces but once primed it's

aquired data -> ( pipeline ) -> processed data

every clock cycle (with a pipeline delay, of course).

In that domain hardware implementations are chosen to work well with vector calculations and with consistent capped timings.

( To be clear, I haven't looked into that specific linked algo, I'm just pointing out it's not a N.R. only world )

link

adgjlsfhk1 546 days ago

sqrt is the one exception to this. the newton series is really good and the polynomials aren't great (and the newton based approach prevents you from having to do range reduction)

link

shihab 546 days ago

yeah, that's very likely the explanation. All these functions are pretty high latency instructions, vs rejection sampling which only involves a multiplication. On Nvidia GPUs, mul has latency of 1-4 cycles while others are 16-32.

link