|
|
|
|
|
by kr7
3402 days ago
|
|
It does work, though not as well as using an integer multiply. The approximate latencies for Skylake are: div --> 26 cycles
cvtsi2sd + mulsd + cvttsd2siq --> 6 + 4 + 6 = 16 cycles
I did a quick (and imperfect) microbenchmark, got these results: Real integer division (-Os) --> 1.392s
FPU Multiply (-Os) --> 0.243s
FPU Multiply (-O2) --> 0.197s
Integer Multiply (-O2) --> 0.164s
The code: #include <stdio.h>
int main() {
volatile unsigned x;
for (unsigned n = 0; n < 100000000; ++n) {
#if 1 /* Change to 0 to use FPU. */
/*
Compile with -Os to get GCC to emit div instruction.
-O2 to emit integer multiply.
Clang emits integer multiply, even with -Os.
*/
x = n / 19;
#else
/* Use the FPU. */
x = (double)n * (1.0 / 19.0);
#endif
}
}
|
|