Hacker News new | ask | show | jobs
by kr7 3402 days ago
It does work, though not as well as using an integer multiply.

The approximate latencies for Skylake are:

    div --> 26 cycles
    cvtsi2sd + mulsd + cvttsd2siq --> 6 + 4 + 6 = 16 cycles
I did a quick (and imperfect) microbenchmark, got these results:

    Real integer division (-Os) --> 1.392s
    FPU Multiply (-Os)          --> 0.243s
    FPU Multiply (-O2)          --> 0.197s
    Integer Multiply (-O2)      --> 0.164s
The code:

    #include <stdio.h>

    int main() {
        volatile unsigned x;
        for (unsigned n = 0; n < 100000000; ++n) {
    #if 1 /* Change to 0 to use FPU. */
            /*
            Compile with -Os to get GCC to emit div instruction.
            -O2 to emit integer multiply.
            Clang emits integer multiply, even with -Os.
            */
            x = n / 19;
    #else
            /* Use the FPU. */
            x = (double)n * (1.0 / 19.0);
    #endif
        }
    }
1 comments

Is it correct? What is x for n < 19 for example?
It works for 32-bit unsigned integers and double precision floats.

For n < 19, "(double)n * (1.0 / 19.0)" evaluates to a double between 0.0 and 1.0, then it is truncated to 0 when it is implicitly converted to unsigned int.

Since there are only 2^32 values for 32-bit integers, it is possible to test all values in under a minute:

    #include <stdio.h>
    #include <stdint.h>

    int main() {
        uint32_t n = 0;
        do {
            uint32_t a = n / 19;
            uint32_t b = (double)n * (1.0 / 19.0);
            if (a != b) {
                printf("Not equal for n = %u\n", n);
            }
            ++n;
        } while (n != 0);
    }