Hacker News new | ask | show | jobs
by MattyDub 1011 days ago
Can somebody explain why a square root is also considered a flop? Surely that involves more work than the other four operations the article listed. Is there some hardware algorithm for the square root that is as fast as (e.g.) division?
3 comments

Square root is pretty much equivalent to division in complexity, and computed by similar techniques (digit-by-digit methods or newton-raphson or goldschmidt iterations). Division is often a little more efficient, but square root has fewer messy edge cases (it never overflows nor underflows).

Division and square root are generally slower than the other arithmetic operations, in both latency and throughput. They are finally partially pipelined in recent CPUs (a result every two or three cycles), but were totally unpipelined in mainstream designs for many years before that. A decade ago, they might take a few tens of cycles, now they’re generally somewhere around ten cycles latency on “real” CPUs, vs 3-5 cycles latency for the other floating point arithmetic instructions.

Relevant Quake engine trivia, for those that are interested:

https://en.m.wikipedia.org/wiki/Fast_inverse_square_root

A question that I am not quite smart enough to figure out. The fast inverse square root works via a giant hack. hand waving a bit here but it has to do with because floats are stored as an exponential structure bit shifting one does magic math things to it. however note that bit shifting a float is not an implemented instruction. so you have to hack it by casting to an int.

Anyway the question is. If you implemented fast inverse square root with out the hack. (i don't know, perhaps packing your own bits, or it might be screwball enough you only assembly could do it.) would it be as fast?

It's not much of a "hack", it's literally how the number format works. If you implemented it without the "hack", you'd just be at the mercy of the compiler to see through your arithmetic on the float bit patterns and reconstitute it.

It's not actually that fast, either. Every mainstream architecture from the last 15 years (except maybe RISC-V, since it's exactly the sort of thing they would forget to add) has a "approximate square root" instruction with single-cycle throughput, and they're all both more accurate and more efficient than this. It was good for its time, but it's time has passed.

Guess: It's one of the operations specified in IEEE 754, and traditionally the implementation is supplied by your hardware vendor rather than implemented by you (even if that implementation is "really" a software algorithm).
On x86-64, there are instructions, see https://stackoverflow.com/a/54642811/468334