|
I haven't run any benchmarks, but jump-if-equal and set-if-equal would seem to have the same level of predictability. My naive, untested intuition is that there's only one meaningful difference: the former has to dump the entire pipeline on a miss, and the latter only has to nop a single instruction on a miss. But maybe I'm missing something. I'll re-read his rant. EDIT: Linus rants a lot, but makes one concrete claim: You can always replace it by
j<negated condition> forward
mov ..., %reg
forward:
and assuming the branch is AT ALL predictable (and 95+% of all branches
are), *the branch-over will actually be a LOT better for a CPU.*
So, I decided to test that. [18:50:14 user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
--- loop2.s 2023-07-06 18:40:11.000000000 -0400
+++ loop4.s 2023-07-06 18:46:58.000000000 -0400
@@ -17,11 +17,15 @@
incq %rdi
xorl %edx, %edx
cmpb $115, %cl
- sete %dl
+ jne _run_switches_jmptgt1
+ mov $1, %dl
+_run_switches_jmptgt1:
addl %edx, %eax
xorl %edx, %edx
cmpb $112, %cl
- sete %dl
+ jne _run_switches_jmptgt2
+ mov $1, %dl
+_run_switches_jmptgt2:
subl %edx, %eax
testb %cl, %cl
jne LBB0_1
[18:50:29 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
[18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop4.s -o l4
[18:51:02 user@boxer ~/src/looptest] $ time ./l2 1000 1
449000
./l2 1000 1 0.69s user 0.00s system 99% cpu 0.697 total
[18:51:09 user@boxer ~/src/looptest] $ time ./l4 1000 1
449000
./l4 1000 1 4.53s user 0.01s system 99% cpu 4.542 total
I feel pretty confident that Linus has made a poor prediction about poor prediction here. Jumps are indeed slower.To be fair to Linus, since Clang and I are using sete here, not cmov, I also tested cmov, and the difference was insignificant: [19:53:12 user@boxer ~/src/looptest] $ time ./l2 1000 1
449000
./l2 1000 1 0.69s user 0.00s system 99% cpu 0.700 total
[19:53:15 user@boxer ~/src/looptest] $ time ./l5 1000 1
449000
./l5 1000 1 0.68s user 0.00s system 99% cpu 0.683 total
Jumps are slower. |
The difference is that branches have dedicated hardware (branch predictors) that will speculatively execute subsequent instructions based on their best guess about which way the branch will go. Whereas conditional moves cannot execute any subsequent instructions until the correct value is available.
Put another way, CPUs have control flow speculation, but not conditional move speculation. I don't know if conditional move speculation would be a feasible thing to implement or not, but I'm pretty sure that no mainstream CPUs have such a feature.