Hacker News new | ask | show | jobs
by dataflow 1033 days ago
I read it and I still don't get it, can someone (re-)explain what the presence of the print() is doing that is helpful for branch prediction (or any other aspect of the CPU)?

Update: It seems to be the conditional move, see https://news.ycombinator.com/item?id=37245325

4 comments

I read it three or four times. It's never explained. If the print("lol") version has a branch-less-than, what does the regular version have? It must either be a branch or a conditional move, but we aren't shown that part of the assembly. You can't reach any conclusions about why one version is faster if you don't know what you're comparing to.
The high-level view is that adding code that never gets executed causes the compiler to emit code that the CPU predicts better. IDK if this is the compiler assuming that the `print()` call is cold or the branch predictor getting luckier by chance but basically this tickles the CPU in the right way to get better performance.

It seems that this is mostly luck in a strange situation. And of course if you ever hit the `print()` it will be way slower than not. You can probably do better by adding something like a `__builtin_expect(...)` intrinsic in the right place to be more explicit about what the goal is here.

I'm in school, so this may be oversimplified, but if the processor/assembly code is predicting the next result, it gets the result faster. The processor only does this prediction with conditional branches. The extra if for printing or finding the min invoke the prediction with the accuracies stated.
No, this is not true.

There is branch prediction around the length of loops. This is a case where the processor is not able to accurately predict how long it needs to stay in the loop. The BLT instruction changes the prediction model, causing the processor to be more likely to assume the loop will continue.

Honestly, though, worrying about this level of optimization is generally silly. If you're looping through an array often enough that optimizing the code this way is worth your time, you should use a data structure that automatically maintains the max (and min) values for fast retrieval.

> The processor only does this prediction with conditional branches

This sounds... wrong? Unless ARM64 is designed in an absurd way?

I'd love to see the full disassembly; something seems funny here. If it was x86 I would say it's a conditional move causing this, but I don't know what's going on on ARM.

It's interesting that you say conditional move here.

I am confused by this behaviour, and although I definitely don't know what the answer is here; the non-lol version does have a CSEL (https://developer.arm.com/documentation/dui0802/b/CSEL) which is totally missing from the lol version.

Non-lol https://godbolt.org/z/ds1raTYc9

lol https://godbolt.org/z/c3afrb6bG

Ah there you go, that's the conditional move on ARM. Yeah, those are slow.
"Conditional move is slow" is not the right takeaway. (In fact, conditional move can be much /faster/ than a branch in certain circumstances (e.g. hard-to-predict branches, like binary search).)

The reason why this code is slow is because the conditional move is on the critical path for resolving a loop-carried data dependency. In other words, to execute the assignment `maxV = (v if (v > maxV) else maxV)`, the CPU has to test v > maxV. But in order to make this test, the CPU has to wait until the conditional move in the previous loop iteration finishes (to know what the new value of maxV is). So the overall execution time of the loop is worse, because each iteration depends on the previous iteration. (This is analogous to why traversing a linked list is slower than an array.)

However, for the branch version, there is no data dependency on the previous loop, so the CPU really can (speculatively) execute multiple loop iterations in parallel. The the next loop iteration can start before the previous finishes. And with the large reorder buffers on modern CPUs, maybe up to 4-8 loop iterations can execute in parallel!

On the other hand, conditional moves would be much faster for a loop like:

  bigs := make([]int, len(numsA))
  for idx, v : range numsA {
    if numsA[idx] > numsB[idx] {
      bigs[idx] = numsA[idx]
    } else {
      bigs[idx] = numsB[idx]
    }
  }
There is no loop-carried data dependency, so the conditional moves can resolve in parallel. The above code with a conditional move is likely to be much faster than code with an equivalent branch. (And even if branches are perfectly predicted, the conditional move version will only be slightly slower, not 2X slower.)
But why does the print statement matter in how the compiler decides this? In both cases, the execution still depends on the vMax value calculated in the previous iteration.
How exactly does that change anything? It's being run on an increasing array, so the data is assigned every loop. Surely the comparison creates a dependency on the value of the variable which was assigned in the last loop.
So would you say in the general case, i.e. the compiler obviously doesn't know how predictable a branch may be, using conditional move is a poor choice when writing to an address which is repeatedly written to inside the loop?

What I am trying to get a feel for, is whether this represents a bad choice of assembly in general for this kind of loop.

Yeah, after reading the blog post I felt reasonably confident that this is a compiler bug (in terms of perf).

Hopefully someone with a deeper understanding can verify or discredit this here. I'm curious to see the fix for this (I assume it will generate a fix).

I don't know if I would classify it as a bug, but definitely a heuristic that could use tuning.

At least (at a high level) in LLVM, branches and cmovs are represented with the exact same construct, and one of the codegen passes looks at the condition and the two sides and heuristically determines whether to emit a branch or a cmov.

I don't know how Go codegen works, but I assume they do something similar.

Confused, why do you think this is a compiler bug? You should definitely expect a predictable branch to be faster than a conditional move, and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).
As to branch prediction, for anyone interested: https://stackoverflow.com/a/11227902