Hacker News new | ask | show | jobs
by torstenvl 1073 days ago
I'm not so sure that the right take-away is "hand-written assembler is 6x faster than C." It's more like "jumps are a lot slower than conditional arithmetic." And that can [edit:often] be achieved easily in C by simply not using switch statements when an if statement or two will work fine.

Rewriting the C function as follows got a 5.5x speedup:

    int run_switches(char *input) {
        int r = 0;
        char c; 
        while (1) {
            c = *input++;
            if (c == 's') r++;
            if (c == 'p') r--;
            if (c == '\0') break;
        }
        return r;
    }
Results:

    [16:50:14 user@boxer ~/looptest] $ gcc -O3 bench.c loop1.c -o lone
    [16:50:37 user@boxer ~/looptest] $ gcc -O3 bench.c loop2.c -o ltwo
    [16:50:47 user@boxer ~/looptest] $ time ./lone 1000 1
    449000
    ./lone 1000 1  3.58s user 0.00s system 99% cpu 3.589 total
    [16:50:57 user@boxer ~/looptest] $ time ./ltwo 1000 1
    449000
    ./ltwo 1000 1  0.65s user 0.00s system 99% cpu 0.658 total
8 comments

Nice! There's a part two in which I rewrote the C. I got a 12x speedup :)

https://owen.cafe/posts/the-same-speed-as-c/

And as others have pointed out, you can tweak the input, then vectorize the algo, if you want to go that route.

I considered this a pedagogical exercise and I sincerely hope nobody will start dropping down to assembly without a very good reason to.

Wondering how res += (c=='s')-(c=='p') might do. I sure there is some C undefined behaviour relevant there. Curious but too lazy to check it myself!
While `false` evaluates to 0, not sure `true` always evaluate to 1 in C... maybe compiler dependent. Maybe add `? 1 : 0`
C doesn't even originally have true/false, I think you may be conflating the two concepts that "any nonzero int is truthy" and "boolean expressions evaluate to ints". The standard mandates that boolean expressions like equality always evaluate to 0/1.
The `true` constant is always 1. C11 §7.18 (3):

> true which expands to the integer constant 1,

And equality yields a 1 or 0. C11 §6.5.9 (3):

> The == (equal to) and != (not equal to) operators are analogous to the relational operators except for their lower precedence. Each of the operators yields 1 if the specified relation is true and 0 if it is false.

ive seen people doing += !!(c=='s')-!!(c=='p') for that
I'm sure people do that (even though it's not necessary per some year C standard) but generally the pattern is actually for converting things which are not already 0 or 1 into 0 or 1. For example, you might want to use it here:

    int num_empty_strings = !!(strlen(s1)) + !!(strlen(s2)) + !!(strlen(s3))
which is equivalent to:

    int num_empty_strings = (strlen(s1) != 0) + (strlen(s2) != 0) + (strlen(s3) != 0)
Which you use is really a matter of coding style.
If we are being cryptic already, why not

    int num_empty_strings = !!*s1 + !!*s2 + !!*s3;
That isn't only more cryptic, it's also potentially a lot more efficient -- strlen takes time proportional to the length of the string, which of course you don't need to do if you only care whether or not the length is zero. You shouldn't use strlen for empty-string tests.
Note that your code computes the number of nonempty strings.
That is entirely unneccessary. An == expression will always evaluate to 1 or 0. The !!x trick can be useful in some other situations, though.

Here’s a thing you could do (but I don’t know why):

+= !(c-’s’) - !(c-’p’)

Great post!

One thought: If the code is rewritten using bit arithmetic, then potentially the result could be even faster as there need not be a pointer look-up.

A bit arithmetic solution would have a mask created for the characters ‘p’ and ‘s’, and then the result could be AND-ed, and then with more bit arithmetic this all 1s value can be translated to a 1 if and only if all the bits are 1. Following which, there would be a no conditional check and simply be both an add and a subtract operation but where the value to be added will only be 1 if the mask for ‘p’ matches and 1 to be subtracted if the mask for ‘s’ matches respectively. I’m not fully sure if this would necessarily be faster than the pointer look-up solution, but it would be interested to try this version of the code and see how fast it performs.

Update: The bit arithmetic could also be done with an XOR on the mask, and following which the ‘popcnt’ x86 instruction could be used to figure out if all are 0 bits.

Thank you for your post and reply but I fear with a post + title like this you may just be chumming the waters.
What do you mean by "chumming the waters" in this context?
People who just skim the headline and article will come away convinced that dropping to assembly is the “way to go fast” even if they never actually do it.
Anyone with a passing understanding of Assembly or compilers would find that idea laughable. As for the others, it turns out not knowing what you don’t know can be very expensive.
"Clickbait"
> jumps are a lot slower than conditional arithmetic.

This statement is true if the jumps are unpredictable. If the jumps are predictable, then jumps will be faster.

Linus had a whole rant about this back in the day, arguing that cmov is not useful if branches are predictable: https://yarchive.net/comp/linux/cmov.html

I haven't run any benchmarks, but jump-if-equal and set-if-equal would seem to have the same level of predictability.

My naive, untested intuition is that there's only one meaningful difference: the former has to dump the entire pipeline on a miss, and the latter only has to nop a single instruction on a miss.

But maybe I'm missing something. I'll re-read his rant.

EDIT:

Linus rants a lot, but makes one concrete claim:

    You can always replace it by
    
      j<negated condition> forward
      mov ..., %reg
     forward:
    
    and assuming the branch is AT ALL predictable (and 95+% of all branches
    are), *the branch-over will actually be a LOT better for a CPU.*
So, I decided to test that.

    [18:50:14 user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
    --- loop2.s 2023-07-06 18:40:11.000000000 -0400
    +++ loop4.s 2023-07-06 18:46:58.000000000 -0400
    @@ -17,11 +17,15 @@
      incq %rdi
      xorl %edx, %edx
      cmpb $115, %cl
    - sete %dl
    + jne _run_switches_jmptgt1
    + mov $1,   %dl
    +_run_switches_jmptgt1:  
      addl %edx, %eax
      xorl %edx, %edx
      cmpb $112, %cl
    - sete %dl
    + jne _run_switches_jmptgt2
    + mov $1,   %dl
    +_run_switches_jmptgt2:  
      subl %edx, %eax
      testb %cl, %cl
      jne LBB0_1
    [18:50:29 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
    [18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop4.s -o l4
    [18:51:02 user@boxer ~/src/looptest] $ time ./l2 1000 1
    449000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.697 total
    [18:51:09 user@boxer ~/src/looptest] $ time ./l4 1000 1
    449000
    ./l4 1000 1  4.53s user 0.01s system 99% cpu 4.542 total
I feel pretty confident that Linus has made a poor prediction about poor prediction here. Jumps are indeed slower.

To be fair to Linus, since Clang and I are using sete here, not cmov, I also tested cmov, and the difference was insignificant:

    [19:53:12 user@boxer ~/src/looptest] $ time ./l2 1000 1            
    449000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.700 total
    [19:53:15 user@boxer ~/src/looptest] $ time ./l5 1000 1            
    449000
    ./l5 1000 1  0.68s user 0.00s system 99% cpu 0.683 total
Jumps are slower.
> jump-if-equal and set-if-equal would seem to have the same level of predictability.

The difference is that branches have dedicated hardware (branch predictors) that will speculatively execute subsequent instructions based on their best guess about which way the branch will go. Whereas conditional moves cannot execute any subsequent instructions until the correct value is available.

Put another way, CPUs have control flow speculation, but not conditional move speculation. I don't know if conditional move speculation would be a feasible thing to implement or not, but I'm pretty sure that no mainstream CPUs have such a feature.

> Whereas conditional moves cannot execute any subsequent instructions until the correct value is available.

That is incorrect. Super-scalar processors have no problem executing subsequent instructions before the cmov writebacks. However, the register cmov writes to can of course not be read before cmov has has passed the execution unit. But that's not different from other arithmetic instructions.

You are correct, I should have clarified, subsequent instructions that depend on the result of the cmov cannot execute until the cmov has executed. Whereas subsequent instructions that depend on the result of the branch instruction can be speculatively executed even before the branch conditional has been evaluated.
True, but independently of whether "cmov rax, ..." or "jnz L; mov rax, ...; L:" is used, subsequent instructions that reads rax needs to stall until rax has been written to (or at least until cmov/mov has executed if bypasses are used).
I'd be curious to learn why CPUs don't have conditional move speculation.
Because modern CPUs as a rule don't speculate on values to arithmetic, only on control flow, and CMOV acts like arithmetic.

That is, if there is an add instruction on rax and rbx, no matter what, the add instruction will not execute until both rbx and rbx are available. If the result went into rax, and there is an another instruction that uses that as a source, no matter what that instruction will not execute until the add has completed.

CMOV is implemented as an ALU instruction that always writes into it's output, and either writes the value that is already in there (which is why it depends on the value of it's output) or the value provided, depending on flags.

I'm not saying you're wrong — I'm completely ignorant at the microcode level — but it seems to me like between

    cmp x, y
    je z
and

    cmp x, y
    sete z
the actual speculative part is the same: speculating as to the result of cmp x, y

If that's true, why would it not simply pipeline sete and the following instructions and simply execute (or not execute) sete according to its prediction, and then double check itself and reverse (or apply) the operation if the prediction was wrong?

I probably just have a bad mental model of what's going on under the (under the) hood, so whatever patience you have to deal with my stupid questions would be greatly appreciated.

I hope you work on compiler backends.
Speculative execution is all about control flow. It's about what value is in the instruction pointer at some nebulous point in the future.

A conditional jump can put one of two values into the instruction pointer, they will either increment the instruction pointer (jump not taken) or put the immediate value into the instruction pointer. (jump taken)

cmov/sete are utterly deterministic; they always increment the instruction pointer. There's nothing to speculate on, there's nothing to predict. They just go to the next instruction.

> Speculative execution is all about control flow

It's murkier than that. Speculation also deals with the order in which instructions can be executed. Take for example memory ordering (discussed in a mini essay elsewhere here): we typically speculate that all loads are unrelated to any other older in-flight stores with unresolved addresses so that we can optimistically launch them. This is not a control flow issue but it is something we both speculate and predict (memory dependence predictors!) despite the next PC being essentially deterministic.

> Speculative execution is all about control flow. It's about what value is in the instruction pointer at some nebulous point in the future.

.. and all about what we can wheedle out of all the background speculation that will help us get root on this box.

One other perspective is that by speculating the outcomes of conditional instructions, you naturally open yourself up to mispeculating them. This sounds obvious but the consequences for the uarch are quite severe. This is because anytime you mispeculate an instruction, most (all?) contemporary CPUs throw out all younger speculative progress (even if it is unrelated!) and restart at the instruction it originally mispeculated. Throwing out all this work is both i) a waste of power/cycles (you did all this speculative work for nothing!) and ii) quite an expensive operation because you either have to iteratively rollback the state (slow!) or take a snapshot the state on every conditional instruction (expensive from power/area perspective).

A similar idea to what you're proposing (and a possible solution to the above issue) does come up in another part of the processor however! Specifically, high performance processors launch loads very aggressively and often times return data as soon as the address is known. This is because memory is often the bottleneck for performance. This, unfortunately, has some challenges. Namely, memory ordering violations. Take for example the following snippet (ARMv8):

    mov x1, #1    
    udiv x3, x2, x1
    str x2, [x3]
    ldr x4, [x2]
    add x5, x4, x4
This is a silly and somewhat contrived code sequence, but note here that both str x2 and ldr x4 access the same address and thus the value in x4 should be x2. Note, however, that since str x2's address (x3) is produced by a slow division operation but ldr x4's address (x2) is available much more quickly, ldr x4 likely will launch before the CPU even knows that str x2 conflicts with it. Thus, the data returned by the load will be whatever random old stale data is in the cache rather than the correct value that is currently sitting in x2. This means that the subsequent add which consumes this data will produce an incorrect value, leading the whole program to derail. Once the CPU detects this issue, it has to throw away all the state and restart execution of the program at ldr x4 in order to fix its mistake and fix up the memory ordering violation. In essence, the CPU is speculating that str x2 and ldr x4 are unrelated because doing so is very important for performance. Unfortunately, however, memory ordering violations are actually somewhat common and constantly having to restart execution has negative performance implication.

Now, this is actually a very similar problem as we'd see with conditional instruction speculation! So how do we solve this issue for memory ordering violations? Well, we predict which pairs of stores and loads are dependent and block the load from launching until the address of its supposed dependent store resolves. If this predictor is functioning well, we are able to both aggressively launch loads while also avoiding many costly fixups!

So, how would we translate this to conditional instruction speculation? Well, one idea is that we could predict both whether a given instruction is predictable and, if so, which way we should predict it. If a conditional instruction is predicted as unpredictable, its result will not be speculated (thereby avoiding frequent costly restarts) but if it is predicted to be predictable, we can try to predict which one to take.

Would this work? Maybe. Will anyone actually do this? Likely not. As others have suggested, conditional instructions are almost exclusively used for hard to predict conditions specifically because CPUs don't speculate them. Thus, in most existing code the predictor would just say "yep can't predict it" and we'd just have ended up wasting a bunch of area and power on a predictor that never gets used.

If you're really dedicated to this cause though, feel free to write a paper on it. Spitballing performance numbers is easy but often wrong in quite surprising ways, so maybe this might just work for some weird reason I've missed :)

Linus' post is 15+ years old. Much has changed in Intel hardware since then. He was probably right on the money re the hardware available at the time.
> I don't know when the change was made, but conditional moves are fast and efficient on the last several generations of AMD and Intel processors. Usually, you are trading 1 or 2 extra cycles of latency against the chance of a ~15 cycle mispredicted branch penalty. If your branch cannot be predicted correctly ~85% of the time, this can be a significant win.

https://news.ycombinator.com/item?id=10749195

I read the rant. He is talking about Pentium 4.
The inputs here are random which is the problem and why this isn't demonstrating that. Create an input of all 's' and compare it.
Better than random input, but still only ~half as fast as using sete

    [19:13:34 user@boxer ~/src/looptest] $ diff -u bench.c bench-alls.c      
    --- bench.c 2023-07-06 16:04:16.000000000 -0400
    +++ bench-alls.c 2023-07-06 19:13:34.000000000 -0400
    @@ -17,7 +17,7 @@
       int num_rand_calls = number / CHAR_BIT + 1;
       unsigned char *buffer = malloc(num_rand_calls * CHAR_BIT);
       for (int i = 0; i < num_rand_calls; i++) {
    -    buffer[i] = rand();
    +    buffer[i] = 's'; //rand();
       }
       return buffer;
     }
    [19:13:37 user@boxer ~/src/looptest] $ gcc -O3 bench-alls.c loop2.s -o l2
    [19:13:42 user@boxer ~/src/looptest] $ gcc -O3 bench-alls.c loop4.s -o l4
    [19:13:47 user@boxer ~/src/looptest] $ time ./l2 1000 1
    250001000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.699 total
    [19:13:55 user@boxer ~/src/looptest] $ time ./l4 1000 1
    250001000
    ./l4 1000 1  1.28s user 0.00s system 99% cpu 1.290 total
Jumps are slower.
Microbenchmarks are hard. You aren't doing any meaningful work that could benefit from speculatively executing instead of stalling for the conditional value.

Similarly you might be busting the pipeline by chaining together the jumps so close together.

Not saying your point is wrong, just saying your proof isn't super solid.

In this benchmark the only loop carried dependency is over the res variable (edit: and of course the index). The jump doesn't break these dependencies, so for this specific problem, the additional latency of the cmov doesn't matter as it is always perfectly pipelined and cmov will always come up on top. But if the input of cmov depended on a previous value, then potentially a branch could be better given an high enough prediciton rate.
Jumps are slower on completely random input. If I understand Linus’s point correctly, he is suggesting that random inputs like this are unusual (although a good way to measure worst case performance)
Did you test this on a Pentium 4, the processor that Linus is talking about?
Is this the reason I dont usually see any speed up if I eliminate array boundary checking in C#? The jump condition is almost always false, is this what "predictable" means?
Indeed.

The cost of bound checking is second order effects like making vectorization harder, slightly higher instruction (and possibly data) cache pressure, or requiring higher decode bandwidth. For the vast majority of programs these bottlenecks do not really matter.

I mean, if the innermost loop is something like 3 assembly instructions, two extra instructions cmp and jg do not make any difference, if jg never executes?
If they are not in the critical path, it doesn't matter. There is no instruction cache issues as the loop is tiny. Also as the loop is tiny it will fit in the u-op cache (or even in the loop cache), so decoding is not an issue either. The only problem is potential lack of vectorization, but a good vector ISA can in principle handle the bound checking with masked reads and writes (but now the check is no longer a predictable branch, but it might end up in the critical path, although it is not necessarily a big cost, or even measurable, anyway).
I thought I began to understand something, your rant proven me wrong) Thanks, anyway)
I am by no means an expert, but I believe what you have in mind would likely fit in i-cache without a problem, so you wouldn’t see a significant difference.

There is an interesting talk titled ‘the death of optimizing compilers’ that argues that for most code these optimizations are almost completely meaningless, and in the hot loops where it actually matters, they are not good compared to humans (and sometimes 100x or more improvements are possible and left on the table). While I don’t completely agree with its points, it is a good talk/slides to read through.

Its not that compilers are stupid, they just dont know what humans know about their data, it's ranges, invariants, symmetries etc. They work on most general case, which can be horribly inefficient.
What version of GCC are you using? For me both versions perform the same, both on Ubuntu and Windows:

    $ time ./lone 1000 1
        851000

        real    0m3.578s
        user    0m3.574s
        sys     0m0.004s
        
    $ time ./ltwo 1000 1
        851000

        real    0m3.583s
        user    0m3.583s
        sys     0m0.000s

    $ gcc --version
        gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
        Copyright (C) 2019 Free Software Foundation, Inc.
        This is free software; see the source for copying conditions.  There is NO
        warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Sorry, I write 'gcc' purely out of force of habit. I'm using Clang/LLVM.

    [17:23:00 user@boxer ~/looptest] $ uname -a
    Darwin boxer.local 21.6.0 Darwin Kernel Version 21.6.0: Thu Jun  8 23:57:12 PDT 2023; root:xnu-8020.240.18.701.6~1/RELEASE_X86_64 x86_64
    [17:23:47 user@boxer ~/looptest] $ cc -v
    Apple clang version 14.0.0 (clang-1400.0.29.202)
    Target: x86_64-apple-darwin21.6.0
    Thread model: posix
    InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Clang generates the sete instruction for me with the above code:

    [17:23:49 user@boxer ~/looptest] $ gcc -c -O3 loop2.c   
    [17:25:00 user@boxer ~/looptest] $ objdump -d --symbolize-operands --x86-asm-syntax=intel --no-show-raw-insn loop2.o
    
    loop2.o: file format mach-o 64-bit x86-64
    
    Disassembly of section __TEXT,__text:
    
    0000000000000000 <_run_switches>:
           0:       push rbp
           1:       mov rbp, rsp
           4:       xor eax, eax
           6:       nop word ptr cs:[rax + rax]
    <L0>:
          10:       movzx ecx, byte ptr [rdi]
          13:       add rdi, 1
          17:       xor edx, edx
          19:       cmp cl, 115
          1c:       sete dl
          1f:       add eax, edx
          21:       xor edx, edx
          23:       cmp cl, 112
          26:       sete dl
          29:       sub eax, edx
          2b:       test cl, cl
          2d:       jne  <L0>
          2f:       pop rbp
          30:       ret
Is rewriting switch statements to a bunch of ifs always faster? Or is there some number of cases where the switch is faster? Seems like it should be added as a compiler optimization if it's consistent.
It shouldn't be.

If the fastest way to implement a particular `switch` in assembly is with the equivalent of a set of `if`s, a reasonably smart compiler "should" be able to output the assembly to do that. And I thought that gcc and clang at least have actually been smart enough to do that for a while now.

But if the number of `if`s is high and the distribution sufficiently dense, where a jump table is better than a bunch of `if`s, then a `switch` should output that.

OTOH, a sufficiently smart compiler could theoretically turn a bunch of `if`s into a `switch`-like jump table - but it's much harder to reason that case through correctly than it is the other way, so I'm not sure any current compilers are sufficiently smart to do that.

On x64, GCC is supposed to use a a jump table with more than 4 cases when the cases are dense enough (gaps of 9 or less) to minimize wasted memory, otherwise it generates sequential comparisons. Testing on Godbolt, it looks like GCC 13.1 uses 10 cases as the jump table threshold for ARM64.
It shouldn't if the control flow is actually identical.

E.g. note how both the switch- and if-based functions generate the same code using a lookup table here:

https://godbolt.org/z/qoT3M7r5G

Shouldn't the compiler be able to do that, too?
It's not true in general.

In general branching code is faster than branchless code and there's many many places that will demonstrate this with a quick Google. You know how many cycles a correctly predicted branch takes? 0.

On the other hand branchless code has to wait for each calculation to reach a certain stage in the pipeline since the thing to be output is dependent on the result. The CPU will have a whole lot of halts.

So why is this faster? Because the input is literally random(). The branch predictor will be wrong. This isn't normal though. The compiler is creating code that will be faster in most normal use cases.

It's an artificial benchmark that works against the output the compiler produces.

You're saying that humans have information to context that allows them to provide use-case-specific optimizations which a compiler which must anticipate general usage couldn't. That's what profile guided optimizers are for though, right?
I tried https://godbolt.org/, and neither Clang nor GCC trunk give the same code for the two programs.

Pretty shocking for such a simple program.

Yes, there’s always the “sufficiently smart compiler” that can generate this code. Question is, does that compiler exist?
I sure hope so. The semantics are trivially identical, the optimizations should be as well, by default - they should depend on semantics, not syntax. And GCC in another comment under this thread seems to be doing similar or identical optimizations in both cases.

I wholly admit that this implies nothing about all optimizers. But it's a pretty reasonable one to expect.

>does that compiler exist?

and if so are the compile times worth it

These days, I's put my money on zig cc. I.E. zig cc -Os or zig cc -O3
Does the zig compiler have many fancy bits? I was under the impression the c support was a "nothing special" compiler that punted to llvm for optimizations.
Correct, zig currently does not do any of its own optimizations. It won't necessarily have identical results to clang because it'll generate equivalent but not identical IR and the optimizer passes may happen to do something different as a result, but it's not going to be consistently different.
Yes, but this will backfire on ARM, where jumps are as roughly fast as conditional arithmetic.

The whole point of using C is not to think about the underlying architecture. As soon as you start taking "jumps are a lot slower than conditional arithmetic on x86" into account, you're not writing in C, you're writing in assembly with extra steps :-)

Note, that's only ARMv7; ARMv8 dropped most of the conditionally executed instruction stuff. And so it isn't even jumps that's fast on ARMv7, it's specifically cases that can be (and are) converted to predicated instrs; jumps are still gonna be slow in general on anything high-perf enough that it needs speculation, which can include the actual jumps of ARMv7.

If a compiler can convert jumpy code to the predicated instrs, it should be able to trivially convert conditional arith to such too (even easier & more consistently than branches I'd say).

Is that because cjumps on ARM are faster, or cmovs on ARM are slower?
Is there any good article comparing performance across programming languages? Seems like everytime I see one they're broken because the tested logic is poorly implemented in language(s) XYZ.
You can shorten the loop to just tree conditionals:

  while (c = *input++) {
      if (c == 's') r++;
      if (c == 'p') r--;
  }