A relevant question from there: Is everything I learned about big-O performance irrelevant now?
The right answer is: No, just pay attention to what you are counting. The expensive ops today are pipeline stalls, caused by cache misses of all kinds. Branch prediction failures are a kind of cache miss. Count pipeline stalls instead of (e.g.) additions, comparisons, or swaps, and then all your big-O intuitions still work.
Modern CPUs maintain counters of various causes of pipeline stalls, so you don't need to guess how many you have had.
. . .
There are techniques to transform computations to branchless forms, of which cmov is a special case. A useful primitive, for example, is
template <typename T>
bool swap_if(
bool c, T& a, T& b) {
T v[2] = {
std::move(a),
std::move(b) };
b = std::move(v[1-c]);
a = std::move(v[c]);
return c;
}
Then the loop body in a partition becomes
l += swap_if(
*r < p, *l, *r);
with no branches. (A compiler could easily translate this to two cmov instructions when T is a machine word, though none do, just yet.) This easily doubles the performance of quicksort on modern CPUs.
Curiously, the time taken when this is compiled with g++ is hugely worse if a is assigned before b, for very non-obvious reasons.
The right answer is: No, just pay attention to what you are counting. The expensive ops today are pipeline stalls, caused by cache misses of all kinds. Branch prediction failures are a kind of cache miss. Count pipeline stalls instead of (e.g.) additions, comparisons, or swaps, and then all your big-O intuitions still work.
Modern CPUs maintain counters of various causes of pipeline stalls, so you don't need to guess how many you have had.
. . .
There are techniques to transform computations to branchless forms, of which cmov is a special case. A useful primitive, for example, is
Then the loop body in a partition becomes with no branches. (A compiler could easily translate this to two cmov instructions when T is a machine word, though none do, just yet.) This easily doubles the performance of quicksort on modern CPUs.Curiously, the time taken when this is compiled with g++ is hugely worse if a is assigned before b, for very non-obvious reasons.