Hacker News new | ask | show | jobs
by trolan 1023 days ago
I'm in school, so this may be oversimplified, but if the processor/assembly code is predicting the next result, it gets the result faster. The processor only does this prediction with conditional branches. The extra if for printing or finding the min invoke the prediction with the accuracies stated.
2 comments

No, this is not true.

There is branch prediction around the length of loops. This is a case where the processor is not able to accurately predict how long it needs to stay in the loop. The BLT instruction changes the prediction model, causing the processor to be more likely to assume the loop will continue.

Honestly, though, worrying about this level of optimization is generally silly. If you're looping through an array often enough that optimizing the code this way is worth your time, you should use a data structure that automatically maintains the max (and min) values for fast retrieval.

> The processor only does this prediction with conditional branches

This sounds... wrong? Unless ARM64 is designed in an absurd way?

I'd love to see the full disassembly; something seems funny here. If it was x86 I would say it's a conditional move causing this, but I don't know what's going on on ARM.

It's interesting that you say conditional move here.

I am confused by this behaviour, and although I definitely don't know what the answer is here; the non-lol version does have a CSEL (https://developer.arm.com/documentation/dui0802/b/CSEL) which is totally missing from the lol version.

Non-lol https://godbolt.org/z/ds1raTYc9

lol https://godbolt.org/z/c3afrb6bG

Ah there you go, that's the conditional move on ARM. Yeah, those are slow.
"Conditional move is slow" is not the right takeaway. (In fact, conditional move can be much /faster/ than a branch in certain circumstances (e.g. hard-to-predict branches, like binary search).)

The reason why this code is slow is because the conditional move is on the critical path for resolving a loop-carried data dependency. In other words, to execute the assignment `maxV = (v if (v > maxV) else maxV)`, the CPU has to test v > maxV. But in order to make this test, the CPU has to wait until the conditional move in the previous loop iteration finishes (to know what the new value of maxV is). So the overall execution time of the loop is worse, because each iteration depends on the previous iteration. (This is analogous to why traversing a linked list is slower than an array.)

However, for the branch version, there is no data dependency on the previous loop, so the CPU really can (speculatively) execute multiple loop iterations in parallel. The the next loop iteration can start before the previous finishes. And with the large reorder buffers on modern CPUs, maybe up to 4-8 loop iterations can execute in parallel!

On the other hand, conditional moves would be much faster for a loop like:

  bigs := make([]int, len(numsA))
  for idx, v : range numsA {
    if numsA[idx] > numsB[idx] {
      bigs[idx] = numsA[idx]
    } else {
      bigs[idx] = numsB[idx]
    }
  }
There is no loop-carried data dependency, so the conditional moves can resolve in parallel. The above code with a conditional move is likely to be much faster than code with an equivalent branch. (And even if branches are perfectly predicted, the conditional move version will only be slightly slower, not 2X slower.)
But why does the print statement matter in how the compiler decides this? In both cases, the execution still depends on the vMax value calculated in the previous iteration.
The version with the print statement requires the branch because it's conditionally executed; the 'continue' statement in the if-block will skip it if the loop finds a new maximum.

Since the function-with-if version requires a branch, there's no advantage (regardless of predictability) in using a conditional move for the assignment to maximum.

How exactly does that change anything? It's being run on an increasing array, so the data is assigned every loop. Surely the comparison creates a dependency on the value of the variable which was assigned in the last loop.
For branch instructions but not for conditional move instructions, the processor will happily execute the next instruction before the values are actually ready, then invalidate the work it has done later if it turns out retrospectively to have been wrong. This is called branch prediction.
For a strictly increasing array, assuming the branch predictor always predicts correctly (guesses taken), the CPU does not have to wait for the test to resolve (v > maxV) before it executes the assignment (maxV = v). So effectively the CPU is just running

  maxV = v
  maxV = v
  maxV = v
  maxV = v
over and over (with tests running in parallel as well, but those matter less, because they are mostly just "checking" whether the branch predictor guessed correctly.).

Then, in that above sequence, there is no /read/ of maxV after each write, so there is no true data dependency. (We don't need to wait for the last write to finish before we write again, unlike the read after write case.)

So would you say in the general case, i.e. the compiler obviously doesn't know how predictable a branch may be, using conditional move is a poor choice when writing to an address which is repeatedly written to inside the loop?

What I am trying to get a feel for, is whether this represents a bad choice of assembly in general for this kind of loop.

Nit: If a variable/object is only being written, that’s fine. There is a data dependency here because there is a read after a write.

I don’t have enough compiler knowledge to be able to definitively answer your question, but I think it would be a reasonable heuristic to favor emitting a branch over a cmov in cases where the cmov participates in a loop-carried data dependency.

The other advantage of emitting a branch is that branch prediction can better adapt to runtime conditions. Branch prediction tends to struggle when data is perfectly random, but most data is not. So in general it’s also reasonable for compilers to default to emitting branches.

Yeah, after reading the blog post I felt reasonably confident that this is a compiler bug (in terms of perf).

Hopefully someone with a deeper understanding can verify or discredit this here. I'm curious to see the fix for this (I assume it will generate a fix).

I don't know if I would classify it as a bug, but definitely a heuristic that could use tuning.

At least (at a high level) in LLVM, branches and cmovs are represented with the exact same construct, and one of the codegen passes looks at the condition and the two sides and heuristically determines whether to emit a branch or a cmov.

I don't know how Go codegen works, but I assume they do something similar.

Confused, why do you think this is a compiler bug? You should definitely expect a predictable branch to be faster than a conditional move, and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).
> and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).

Would you mind expanding? If the conditional move isn't needed, and given that it's costly in terms of perfs, how is that “totally sensible” to have one here?