It's interesting that ARM32 has conditional execution and I like them a lot for writing readable assembly code. Short jumps that result from a simple if can be encoded in three successive instructions, no branches.
However, it's now falling out of favor (mostly gone from ARM 64) and apparently it's due to the relative cost of putting conditional execution on the die vs. relying on smarter compilers.
I think ARM64 dropped predicates to remove excessive flags-register read ports. Nearly every instruction could read flags register and this limited core frequency (critical path), opportunities to out-of-order execution (and register renaming). Not sure though, maybe someone who knows more about OoO on ARM64 could fill me in?
It's also a huge waste of instruction encoding space! Especially given that 99% of non-branch instructions just put 1110 (AL) or can't be conditional at all (1111).
Incidentally, available register read ports are the reason why cmov takes 2 µop on Intel architectures, but only 1 on AMD, since Intel's µops can only read from two sources but cmov has 3 sources.
Then A32 could have up 4 sources with a conditional op + register shifted register source.
One simple branchless optimization form I've used is collision detection across an array of values: instead of testing each one, i add their value to a counter(perhaps with some mapping of data to collision value). After iterating over a lot of them, I can do just one test. This is very cpu-friendly as the pipeline gets to crunch all the numbers in one go.
Nice article. It inspired me to look around for some more straightforward way of optimizing, and I found the setcc class of instructions: http://www.nynaeve.net/?p=178
I'm thinking that this combined with some CAS (CMPXCHG8B) could acheive the same, right?
I think CAS is a pretty slow operation even without a LOCK prefix. You probably don't want to use it for purposes other than intercore synchronization.
If you have a lot of data to process, using SSE/AVX is a huge win. Conditional masking and min/max instructions for example.
SIMD is a huge win especially in sorting, you can have 10-40x speed-up by using a bitonic sorting network.
Aren't setcc/cmov* instructions effectively similar to a branch? To compute the result you need to execute the previous instruction.
I suppose that these instructions do not cause the instruction pipeline to be flushed, compared to an incorrectly predicted jump, but they still stall until the previous instruction has been executed.
I know that gcc and clang both have __builtin_expect(). If you tell the compiler the more likely path, wouldn't that make the branching version faster?
Actually, I've always wondered how __builtin_expect translates to something the CPU's branch prediction engine can use...
I think in general, most CPU architectures pick branch not taken for forward branches and branch taken for backwards branches on the first try. So I feel like builtin_expect gives weighting to let the compiler shuffle code around to make it fit that pattern.
There are ways of doing it in hardware, I remember a supervisor discussing it with respect to MIPS. I also remember them saying they went through the entire code generation stage of GCC and found that every single point at which GCC would try to use it was somewhere where it would be actively unhelpful.
__builtin_expect is good for error handling code, because gcc can avoid size-increasing optimizations in that path, and it can even move all the unlikely code out into another section so it won't take up space in your caches.
But its code generation is better for a 99%/1% case than a 60%/40% case, because Intel doesn't listen to branch hints anymore nor really give advice on how to tune for them.
On i386, CS and DS prefix bytes have no meaning but are valid for conditional jump instructions and are then used by some CPU's to provide hints to branch predictor.
As an example of another (arguably more sane) architecture, on Alpha, all branch instructions (including non-conditional ones) have bits reserved in their encoding for hints to branch predictor.
Branch prediction in some search in a hash will fail 50% of the time, however it's done. Branch prediction on a long for loop was more than 99% correct by the end of the 90's already.
Intel claims their prediction algorithms are over 96% correct on an average program, whatever that beast is. (To be fair, you'll find a definition for it at their papers. That's a perfectly legit claim, it just does not mean what you think it means.)
If your data is unpredictable, the branchless instructions save time wasted trying to predict it. If you're doing any kind of data compression, either your compressed data is unpredictable or it's not compressed enough!
Getting rid of branching in the unimportant parts of your program is good too; saves space in the cache for important branch predictions.
This is the kind of thing where it's not worth doing by hand, but I don't think compilers really do it either…
However, it's now falling out of favor (mostly gone from ARM 64) and apparently it's due to the relative cost of putting conditional execution on the die vs. relying on smarter compilers.