Hacker News new | ask | show | jobs
by kr7 3404 days ago
That page has a warning at the top:

> IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.

Looking over the results, some of the numbers are off.

On Intel CPUs, FP multiplication is faster than integer division. Might not be true on ARM CPUs which generally have slower FPUs.

On Skylake, for example, 32-bit unsigned integer division has a 26 cycle latency with a throughput of 1 instruction / 6 cycles, while 32/64-bit floating point multiplication has a 4 cycle latency with a throughput of 2 instructions / cycle.

Source: http://agner.org/optimize/instruction_tables.pdf

1 comments

The crucial point, however, is that while FP multiplication is faster than integer division, converting between a floating point and an integer is very slow: cvtsi2ss and cvtss2si have a latency of 6 cycles each. This adds a latency of 12 cycles for each of these multiplications.

For divisions by a constant value that don't easily decompose into shifts you can fall back to multiplication by a magic constant which is the integer reciprocal. (This is also something compilers do and is what's being explained in the article.)

Multiplying by the integer reciprocal only works if the dividend is an integer multiple of the divisor.

What's being explained in the article is multiplying by a fraction the value of which is close to the rational reciprocal of the divisor, and where the denominator of the fraction is an integer power of two (so dividing by the denominator can be done with a shift).

The fraction in this case is (2938661835 + 2^32) / 2^37.