Hacker News new | ask | show | jobs
by mgaunard 1206 days ago
The compiler is already reducing integer division by a constant into these things.

Those algorithms become more important when you're dividing by a value known at runtime but which remains the same during parts of the program. That's where libdivide comes in.

2 comments

libdivide tl;dr:

> libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. [...] divide SIMD vectors by runtime constants, [1]

> libdivide.h is a header-only C/C++ library for optimizing integer division. Integer division is one of the slowest instructions on most CPUs e.g. on current x64 CPUs a 64-bit integer division has a latency of up to 90 clock cycles whereas a multiplication has a latency of only 3 clock cycles. libdivide allows you to replace expensive integer divsion instructions by a sequence of shift, add and multiply instructions that will calculate the integer division much faster.

> On current CPUs you can get a speedup of up to 10x for 64-bit integer division and a speedup of up to to 5x for 32-bit integer division when using libdivide. libdivide also supports SSE2, AVX2 and AVX512 vector division which provides an even larger speedup. You can test how much speedup you can achieve on your CPU using the benchmark program.[2]

[1] https://libdivide.com/ [2] https://github.com/ridiculousfish/libdivide

The obvious question this invites: if this is generally faster for unknown values, why don't compilers use this optimization directly in emitted code?
They could, but it's less profitable, and cost models are, as always, a big problem. If we have:

  loop for i from 0 to n
      f(a[i]/y)
How big does n have to be before it makes sense to compute a reciprocal for y? And how frequently is it actually that big during the runtime of the code?
> Compilers usually do this, but only when the divisor is known at compile time

if the divisor is in a variable or otherwise 'hidden', the compiler can't deduce enought to get to it, is my guess.

Calculating the divisor values is also expensive, this works when you do this work once and then do the efficient divides multiple times.
See https://godbolt.org/z/3Yqbceaza for what godbolt says clang produces.

Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.

> Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.

You can use __builtin_assume for this: https://godbolt.org/z/K4jKhxnTq

An assert() also does the trick: https://godbolt.org/z/MecvMGPdW

edit uh but when asserts are disabled it won't work: https://godbolt.org/z/4TMs1Wc5z

unless you roll your own assert: https://godbolt.org/z/4v35rrTvn

9*x/64 still reduces to 2 instructions

https://godbolt.org/z/6WsWqh4ah