| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by torstenvl 1073 days ago

I haven't run any benchmarks, but jump-if-equal and set-if-equal would seem to have the same level of predictability.

My naive, untested intuition is that there's only one meaningful difference: the former has to dump the entire pipeline on a miss, and the latter only has to nop a single instruction on a miss.

But maybe I'm missing something. I'll re-read his rant.

EDIT:

Linus rants a lot, but makes one concrete claim:

    You can always replace it by
    
      j<negated condition> forward
      mov ..., %reg
     forward:
    
    and assuming the branch is AT ALL predictable (and 95+% of all branches
    are), *the branch-over will actually be a LOT better for a CPU.*

So, I decided to test that.

    [18:50:14 user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
    --- loop2.s 2023-07-06 18:40:11.000000000 -0400
    +++ loop4.s 2023-07-06 18:46:58.000000000 -0400
    @@ -17,11 +17,15 @@
      incq %rdi
      xorl %edx, %edx
      cmpb $115, %cl
    - sete %dl
    + jne _run_switches_jmptgt1
    + mov $1,   %dl
    +_run_switches_jmptgt1:  
      addl %edx, %eax
      xorl %edx, %edx
      cmpb $112, %cl
    - sete %dl
    + jne _run_switches_jmptgt2
    + mov $1,   %dl
    +_run_switches_jmptgt2:  
      subl %edx, %eax
      testb %cl, %cl
      jne LBB0_1
    [18:50:29 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
    [18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop4.s -o l4
    [18:51:02 user@boxer ~/src/looptest] $ time ./l2 1000 1
    449000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.697 total
    [18:51:09 user@boxer ~/src/looptest] $ time ./l4 1000 1
    449000
    ./l4 1000 1  4.53s user 0.01s system 99% cpu 4.542 total

I feel pretty confident that Linus has made a poor prediction about poor prediction here. Jumps are indeed slower.

To be fair to Linus, since Clang and I are using sete here, not cmov, I also tested cmov, and the difference was insignificant:

    [19:53:12 user@boxer ~/src/looptest] $ time ./l2 1000 1            
    449000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.700 total
    [19:53:15 user@boxer ~/src/looptest] $ time ./l5 1000 1            
    449000
    ./l5 1000 1  0.68s user 0.00s system 99% cpu 0.683 total

Jumps are slower.

5 comments

haberman 1073 days ago

> jump-if-equal and set-if-equal would seem to have the same level of predictability.

The difference is that branches have dedicated hardware (branch predictors) that will speculatively execute subsequent instructions based on their best guess about which way the branch will go. Whereas conditional moves cannot execute any subsequent instructions until the correct value is available.

Put another way, CPUs have control flow speculation, but not conditional move speculation. I don't know if conditional move speculation would be a feasible thing to implement or not, but I'm pretty sure that no mainstream CPUs have such a feature.

bjourne 1073 days ago

> Whereas conditional moves cannot execute any subsequent instructions until the correct value is available.

That is incorrect. Super-scalar processors have no problem executing subsequent instructions before the cmov writebacks. However, the register cmov writes to can of course not be read before cmov has has passed the execution unit. But that's not different from other arithmetic instructions.

haberman 1073 days ago

You are correct, I should have clarified, subsequent instructions that depend on the result of the cmov cannot execute until the cmov has executed. Whereas subsequent instructions that depend on the result of the branch instruction can be speculatively executed even before the branch conditional has been evaluated.

bjourne 1072 days ago

True, but independently of whether "cmov rax, ..." or "jnz L; mov rax, ...; L:" is used, subsequent instructions that reads rax needs to stall until rax has been written to (or at least until cmov/mov has executed if bypasses are used).

NohatCoder 1072 days ago

The difference is that in the case where the condition is false and predicted false the jump variant will not delay if the value being moved into rax is delayed, the cmov variant will. Effectively that value becomes a false dependency.

As best I can tell this case is rare enough that one shouldn't generally be afraid of cmov, and probably compiler authors should consider using it more frequently.

What one shouldn't do is to load values, that are likely in memory or L3, unnecessarily in order to be able to use cmov. It is the case that runs the greatest risk of degrading performance, and it puts extra load on resources that are shared between cores.

haberman 1072 days ago

There is also the issue of the branch predicate itself. It is always a true dependency, but when is its value actually needed? For cmov, it is needed before dependent instructions can even be executed. For branch instructions, it is only needed before they can be retired. Speculative execution can keep the pipeline full in the meantime.

bjourne 1072 days ago

Oh, right! I totally forgot about that. I guess it (at least theoretically) could make a big difference in code for the abs function if the noop is the common case and also easily predictable.

Lk7Of3vfJS2n 1073 days ago

I'd be curious to learn why CPUs don't have conditional move speculation.

Tuna-Fish 1073 days ago

Because modern CPUs as a rule don't speculate on values to arithmetic, only on control flow, and CMOV acts like arithmetic.

That is, if there is an add instruction on rax and rbx, no matter what, the add instruction will not execute until both rbx and rbx are available. If the result went into rax, and there is an another instruction that uses that as a source, no matter what that instruction will not execute until the add has completed.

CMOV is implemented as an ALU instruction that always writes into it's output, and either writes the value that is already in there (which is why it depends on the value of it's output) or the value provided, depending on flags.

torstenvl 1073 days ago

I'm not saying you're wrong — I'm completely ignorant at the microcode level — but it seems to me like between

    cmp x, y
    je z

and

    cmp x, y
    sete z

the actual speculative part is the same: speculating as to the result of cmp x, y

If that's true, why would it not simply pipeline sete and the following instructions and simply execute (or not execute) sete according to its prediction, and then double check itself and reverse (or apply) the operation if the prediction was wrong?

I probably just have a bad mental model of what's going on under the (under the) hood, so whatever patience you have to deal with my stupid questions would be greatly appreciated.

Tuna-Fish 1072 days ago

The two sequences look very similar, and could be implemented the same way, but the actual implementation could not be more different.

> If that's true, why would it not simply pipeline sete and the following instructions and simply execute (or not execute) sete according to its prediction, and then double check itself and reverse (or apply) the operation if the prediction was wrong?

You cannot just reverse or apply one operation. The way speculation works, when the frontend encounters a conditional jump, the entire architectural state of the current thread is stored, and all future memory writes are held in the store buffer and not written out. Then a long time, potentially dozens of cycles later, after the je is executed in the backend either the old state is restored and the pending writes are discarded, or the saved state is discarded and the pending writes are released.

In contrast, in ALUs, the inputs for instructions are always available before the instructions are scheduled to execute. It would be possible to implement sete like je, but this would imply significant changes to how and where it is executed. ALU ops cannot trigger speculation because there is no machinery for storing state at that part of the pipeline.

And no-one is ever going to implement cmov or sete like a jump, because moving the op from being an ALU op to being one that is speculatively executed in the frontend like jmp would make both positive and negative changes, and that would be a significant pessimization of existing software because for decades cmovs have been used for unpredictable values, where sequencing and waiting for the real value is a better idea than speculating and failing half the time. Using a cmov serializes execution when any following operations use the value, but if you can have independent work after it, you can always successfully execute that. Speculating at an unpredictable CMOV would cause that to be thrown away uselessly half the time.

celrod 1073 days ago

Taking the example:

      cmpb $115, %cl
      sete %dl
      addl %edx, %eax

vs

      cmpb $115, %cl
      jne _run_switches_jmptgt1
      mov $1,   %dl
     _run_switches_jmptgt1:  
      addl %edx, %eax

The argument about why `jne` might be faster is that that in the former case, the CPU always executes a dependency chain of length 3: `cmpb` -> `sete` -> `addl`. Each of these instructions have to be computed one after the other, as `sete` depends on the result of `cmpb`, and `addl` depends on the result of `sete`.

With `jne`, the CPU might predict the branch is not taken, in which case, the dependency chain is `mov` -> `addl` (the `mov` of an immediate might be handled by register renaming?).

Or that it is taken, in which case in which case the dependency chain is just `addl`.

I guess you're arguing that the CPU should handle `sete` the same way? That is, instead of treating `addl` as dependent on the result, predict what `sete` does and start executing `addl` before `sete` finishes, rewinding if that went wrong?

torstenvl 1073 days ago

Yeah, or at least I don't understand why that wouldn't be possible.

Microcode can set the EIP register based on its prediction of what the result of cmpb $115, %cl will be.

Why can't it set the EDX register based on its prediction of what the result of cmpb $115, %cl will be?

ithkuil 1072 days ago

The purpose of control flow speculation is to avoid stalling the pipeline.

If each instruction was executed in one single clock cycle, the cost of executing a branch would be one cycle and that's it.

However since there is a maximum speed at which operations can happen in hardware, the period of such a clock cycle that can execute a whole instruction would be very long and so the amount of "instructions per second" the CPU could execute would be low.

Now, if you can break up each instruction in smaller steps and execute the smaller steps in an overlapping manner, such that while you're executing the second step of the first instruction you're executing the first step of the next instruction and so on (like on an assembly line in a factory) you can have a much shorter clock period for each of these steps, and at the end of each clock tick an instruction would complete execution. The CPU will be still running one instruction per clock cycle, but since each clock period is shorter the overall instruction per second rate will be higher.

But for this to work the next instruction you want to execute must be known in advance so that at each clock cycle the CPU can start step 1 of a new instruction.

That's easy when the program is executing sequentially but when there are branches involved it's more tricky.

And that's tricky also if the branch is not conditional! If the instruction execution is broken into many small steps, it may take one or more steps before figuring out that you have a branch in the first place, let alone decoding where you need to branch to. In the meantime the CPU will have happily started to execute the first "steps" of the next instruction.

This is called a "branch hazard"

Early CPU implementations handled branch hazards by just throwing away the intermediate states if the few instructions that we're half way through the pipeline and call it a day (stalling the pipeline).

Early RISC CPUs attempted to be clever and use a trick called "delay slots": the instruction(s) already in the pipeline will continue to execute as if they were logically before the branch. This puta the onus to the programmer (or the compiler) to make sure that only instructions that are safe to be executed regardless of whether the branch is taken or not, are actually put after the branch instruction (otherwise you can just write nops).

But branch delay slots are not a panacea. As pipelines got deeper it became I practical to have a large number of delay slots and even a small number of delay slots were often just filled with nops anyway.

Improving on UNconditional branches was done by "looking ahead" in the instruction stream for branch instructions. When the instructions are all of the same size it's easy to quickly look a few instructions ahead and tell when you found a branch. You also need an instruction encoding scheme that is relatively fast to decode, at the very least it should be fast to decode branches (the more complicated the logic to decode a branch is, the farther ahead you'd have to look in the instruction stream, which in turn would limit the size of the sequence of instructions you can fill your pipeline with between subsequent branches).

To further complicate the matter, even if you found the branch instruction and you decoded it, it doesn't mean you yet know where it will branch to!

Indirect jumps (where the address is in a register) are similar to conditional jumps in that you don't know the address you're jumping to by merely looking ahead in the instruction stream and noticing the branch instruction. You need to either wait until you execute the branch and stall the pipeline in the meantime, or keep them in the pipeline and flush the pipeline once you know the target of the branch.

The next trick that CPU designers came up way before speculative execution is "branch target prediction".

The CPU keeps a little associative memory that maps addresses of a branch instruction to branch targets. When the lookahead logic spots a branch instruction it looks in this map and gets a guess of the branch target and uses that immediately ad the next instruction so that the pipeline is kept fed with something.

If by the time the branch instruction is executed the guess turned out to be wrong, the pipeline is flushed in the same way it would have to be flushed anyway if we had no clever branch lookahead in the first place. But if the guess was right we paid only one cycle to execute the branch.

This works for indirect unconditional branches and also for conditional branches! The prediction logic can be more subtle and complicated, many many things gave been attempted but this the general idea.

sitkack 1073 days ago

I hope you work on compiler backends.

epcoa 1073 days ago

With all due respect this is quite literally the level of stuff covered in an undergrad EE architecture course and is covered in an elementary text like Patterson and Hennessy.

sitkack 1071 days ago

> With all due respect

> quite literally

You could have conveyed the close to the same thing by saying, "things like this are covered in Patterson and Hennessy"

> elementary text

Jesus, do you even lift? The rest of the discussion is amazing.

nwallin 1073 days ago

Speculative execution is all about control flow. It's about what value is in the instruction pointer at some nebulous point in the future.

A conditional jump can put one of two values into the instruction pointer, they will either increment the instruction pointer (jump not taken) or put the immediate value into the instruction pointer. (jump taken)

cmov/sete are utterly deterministic; they always increment the instruction pointer. There's nothing to speculate on, there's nothing to predict. They just go to the next instruction.

sweetjuly 1072 days ago

> Speculative execution is all about control flow

It's murkier than that. Speculation also deals with the order in which instructions can be executed. Take for example memory ordering (discussed in a mini essay elsewhere here): we typically speculate that all loads are unrelated to any other older in-flight stores with unresolved addresses so that we can optimistically launch them. This is not a control flow issue but it is something we both speculate and predict (memory dependence predictors!) despite the next PC being essentially deterministic.

kazinator 1073 days ago

> Speculative execution is all about control flow. It's about what value is in the instruction pointer at some nebulous point in the future.

.. and all about what we can wheedle out of all the background speculation that will help us get root on this box.

sweetjuly 1072 days ago

One other perspective is that by speculating the outcomes of conditional instructions, you naturally open yourself up to mispeculating them. This sounds obvious but the consequences for the uarch are quite severe. This is because anytime you mispeculate an instruction, most (all?) contemporary CPUs throw out all younger speculative progress (even if it is unrelated!) and restart at the instruction it originally mispeculated. Throwing out all this work is both i) a waste of power/cycles (you did all this speculative work for nothing!) and ii) quite an expensive operation because you either have to iteratively rollback the state (slow!) or take a snapshot the state on every conditional instruction (expensive from power/area perspective).

A similar idea to what you're proposing (and a possible solution to the above issue) does come up in another part of the processor however! Specifically, high performance processors launch loads very aggressively and often times return data as soon as the address is known. This is because memory is often the bottleneck for performance. This, unfortunately, has some challenges. Namely, memory ordering violations. Take for example the following snippet (ARMv8):

    mov x1, #1    
    udiv x3, x2, x1
    str x2, [x3]
    ldr x4, [x2]
    add x5, x4, x4

This is a silly and somewhat contrived code sequence, but note here that both str x2 and ldr x4 access the same address and thus the value in x4 should be x2. Note, however, that since str x2's address (x3) is produced by a slow division operation but ldr x4's address (x2) is available much more quickly, ldr x4 likely will launch before the CPU even knows that str x2 conflicts with it. Thus, the data returned by the load will be whatever random old stale data is in the cache rather than the correct value that is currently sitting in x2. This means that the subsequent add which consumes this data will produce an incorrect value, leading the whole program to derail. Once the CPU detects this issue, it has to throw away all the state and restart execution of the program at ldr x4 in order to fix its mistake and fix up the memory ordering violation. In essence, the CPU is speculating that str x2 and ldr x4 are unrelated because doing so is very important for performance. Unfortunately, however, memory ordering violations are actually somewhat common and constantly having to restart execution has negative performance implication.

Now, this is actually a very similar problem as we'd see with conditional instruction speculation! So how do we solve this issue for memory ordering violations? Well, we predict which pairs of stores and loads are dependent and block the load from launching until the address of its supposed dependent store resolves. If this predictor is functioning well, we are able to both aggressively launch loads while also avoiding many costly fixups!

So, how would we translate this to conditional instruction speculation? Well, one idea is that we could predict both whether a given instruction is predictable and, if so, which way we should predict it. If a conditional instruction is predicted as unpredictable, its result will not be speculated (thereby avoiding frequent costly restarts) but if it is predicted to be predictable, we can try to predict which one to take.

Would this work? Maybe. Will anyone actually do this? Likely not. As others have suggested, conditional instructions are almost exclusively used for hard to predict conditions specifically because CPUs don't speculate them. Thus, in most existing code the predictor would just say "yep can't predict it" and we'd just have ended up wasting a bunch of area and power on a predictor that never gets used.

If you're really dedicated to this cause though, feel free to write a paper on it. Spitballing performance numbers is easy but often wrong in quite surprising ways, so maybe this might just work for some weird reason I've missed :)

seventhson 1073 days ago

Linus' post is 15+ years old. Much has changed in Intel hardware since then. He was probably right on the money re the hardware available at the time.

torstenvl 1073 days ago

> I don't know when the change was made, but conditional moves are fast and efficient on the last several generations of AMD and Intel processors. Usually, you are trading 1 or 2 extra cycles of latency against the chance of a ~15 cycle mispredicted branch penalty. If your branch cannot be predicted correctly ~85% of the time, this can be a significant win.

https://news.ycombinator.com/item?id=10749195

zaxomi 1072 days ago

I read the rant. He is talking about Pentium 4.

AnotherGoodName 1073 days ago

The inputs here are random which is the problem and why this isn't demonstrating that. Create an input of all 's' and compare it.

torstenvl 1073 days ago

Better than random input, but still only ~half as fast as using sete

    [19:13:34 user@boxer ~/src/looptest] $ diff -u bench.c bench-alls.c      
    --- bench.c 2023-07-06 16:04:16.000000000 -0400
    +++ bench-alls.c 2023-07-06 19:13:34.000000000 -0400
    @@ -17,7 +17,7 @@
       int num_rand_calls = number / CHAR_BIT + 1;
       unsigned char *buffer = malloc(num_rand_calls * CHAR_BIT);
       for (int i = 0; i < num_rand_calls; i++) {
    -    buffer[i] = rand();
    +    buffer[i] = 's'; //rand();
       }
       return buffer;
     }
    [19:13:37 user@boxer ~/src/looptest] $ gcc -O3 bench-alls.c loop2.s -o l2
    [19:13:42 user@boxer ~/src/looptest] $ gcc -O3 bench-alls.c loop4.s -o l4
    [19:13:47 user@boxer ~/src/looptest] $ time ./l2 1000 1
    250001000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.699 total
    [19:13:55 user@boxer ~/src/looptest] $ time ./l4 1000 1
    250001000
    ./l4 1000 1  1.28s user 0.00s system 99% cpu 1.290 total

Jumps are slower.

Guvante 1073 days ago

Microbenchmarks are hard. You aren't doing any meaningful work that could benefit from speculatively executing instead of stalling for the conditional value.

Similarly you might be busting the pipeline by chaining together the jumps so close together.

Not saying your point is wrong, just saying your proof isn't super solid.

gpderetta 1072 days ago

In this benchmark the only loop carried dependency is over the res variable (edit: and of course the index). The jump doesn't break these dependencies, so for this specific problem, the additional latency of the cmov doesn't matter as it is always perfectly pipelined and cmov will always come up on top. But if the input of cmov depended on a previous value, then potentially a branch could be better given an high enough prediciton rate.

MobiusHorizons 1073 days ago

Jumps are slower on completely random input. If I understand Linus’s point correctly, he is suggesting that random inputs like this are unusual (although a good way to measure worst case performance)

zaxomi 1072 days ago

Did you test this on a Pentium 4, the processor that Linus is talking about?