Hacker News new | ask | show | jobs
by torusle 530 days ago
There are couple of tricks you can do if you fiddle with the bits of a floating point value using integer arithmetic and binary logic.

That was a thing back in the 90th..

I wonder how hard the performance hit from moving values between integer and float pipeline is nowadays.

Last time I looked into that was the Cortex-A8 (first I-Phone area). Doing that kind of trick costed around 26 cycles (back and forth) due to pipeline stalls back then.

3 comments

There are basic integer operations in the FP/SIMD units on most CPUs, so there’s no generally need to “move back and forth” unless you need to branch on the result of a comparison, use a value as an address, or do some more specialized arithmetic.
On x86 there is sometimes (depending on the specific microarchitecture) an extra cycle additional latency when using an integer operation on a xmm register last used with a float operation.

I have seen it explained as the integer and foat ALUs's being physically distant and the forwarding network needing an extra cycle to transport the operands.

This is correct, but it’s happily pretty rare for it to matter in practice (because the domain bypass penalty is small and does not directly impact throughput, only latency).
(For that matter, though, most modern FP/SIMD units have a direct approximate-reciprocal instruction that is single-cycle throughput or better and much more accurate--generally around 10-12 bits, so there's no need for this sort of thing. See, e.g. FRECPE on ARM NEON and [V]RCPP[S/D] on x86.)
These kinds of tricks are still used today. They're not so useful if you need a reciprocal or square root, since CPUs now have dedicated hardware for that, but it's different if you need a _cube_ root or x^(1/2.4).
I wonder to what extent the dedicated hardware is essentially implementing the same steps but at the transistor level.
The big cores do. They essentially pump division through something like an FMA (fused multiply-add) unit, possibly the same unit that is used for multiplication and addition. That's for the Newton-Raphson steps, or Goldschmidt steps.

In hardware it's much easier to do a LUT-based approximation for the initial estimate rather than the subtraction trick, though.

It's common for CPUs to give 6-8 accurate bits in the approximation. x86 gives 13 accurate bits. Back in 1975, the Cray 1 gave 30 (!) accurate bits in the first approximation, and it didn't even have a division instruction (everything about that machine was big and fast).

There is also the case with machines that lack FP support, like some ARM Cortex M variants.