| HN Mirror

A relevant question from there: Is everything I learned about big-O performance irrelevant now?

The right answer is: No, just pay attention to what you are counting. The expensive ops today are pipeline stalls, caused by cache misses of all kinds. Branch prediction failures are a kind of cache miss. Count pipeline stalls instead of (e.g.) additions, comparisons, or swaps, and then all your big-O intuitions still work.

Modern CPUs maintain counters of various causes of pipeline stalls, so you don't need to guess how many you have had.

. . .

There are techniques to transform computations to branchless forms, of which cmov is a special case. A useful primitive, for example, is

  template <typename T>
  bool swap_if(
      bool c, T& a, T& b) {
    T v[2] = {
      std::move(a),
      std::move(b) };
    b = std::move(v[1-c]);
    a = std::move(v[c]);
    return c;
  }

Then the loop body in a partition becomes

  l += swap_if(
    *r < p, *l, *r);

with no branches. (A compiler could easily translate this to two cmov instructions when T is a machine word, though none do, just yet.) This easily doubles the performance of quicksort on modern CPUs.

Curiously, the time taken when this is compiled with g++ is hugely worse if a is assigned before b, for very non-obvious reasons.