| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nwallin 1073 days ago

IMHO the original code wasn't written in a way that's particularly friendly to compilers. If you write it like this:

    int run_switches_branchless(const char* s) {
        int result = 0;
        for (; *s; ++s) {
            result += *s == 's';
            result -= *s == 'p';
        }
        return result;
    }

...the compiler will do all the branchless sete/cmov stuff as it sees fit. It will be the same speed as the optimized assembly in the post, +/- something insignificant. However it won't unroll and vectorize the loop. If you write it like this:

    int run_switches_vectorized(const char* s, size_t size) {
        int result = 0;
        for (; size--; ++s) {
            result += *s == 's';
            result -= *s == 'p';
        }
        return result;
    }

It will know the size of the loop, and will unroll it and use AVX-512 instructions if they're available. This will be substantially faster than the first loop for large inputs, although I'm too lazy to benchmark just how much faster it is.

Now, this requires knowing the size of your string in advance, and maybe you're the sort of C programmer who doesn't keep track of how big your strings are. I'm not your coworker, I don't review your code. Do what you want. But you really really probably shouldn't.

https://godbolt.org/z/rde51zMd8

4 comments

414owen 1073 days ago

The version that's friendly to the compiler is described in part two: https://owen.cafe/posts/the-same-speed-as-c/

It achieves 3.88GiB/s

I intentionally didn't go down the route of vectorizing. I wanted to keep the scope of the problem small, and show off the assembly tips and tricks in the post, but maybe there's potential for a future post, where I pad the input string and vectorize the algorithm :)

nwallin 1073 days ago

So I downloaded your code. On my desktop, with loop-9 gcc I got ~4.5GB/s, and with loop-7 I got ~4.4GB/s. With the following code:

    #include <stddef.h>
    
    int run_switches(const char *s, size_t n) {
      int res = 0;
      for (; n--; ++s)
        res += (*s == 's') - (*s == 'p');
      return res;
    }

I got ~31GB/s in GCC and ~33GB/s in Clang. This is without any padding, or SIMD intrinsics, or any such nonsense. This is just untying the compiler's hands and giving it permission to do its job properly.

Don't want to pass the string length? That's fine, we can figure that out for ourselves. This code:

    #include <stddef.h>
    #include <string.h>

    int run_switches(const char *s) {
      int res = 0;
      for (size_t n = strlen(s); n--; ++s)
        res += (*s == 's') - (*s == 'p');

      return res;
    }

Is 27GB/s. With a little bit of blocking:

    #include <stddef.h>
    
    int run_switches(const char *s, size_t n) {
      int res = 0;
      char tmp = 0;
      for (size_t i = n & 63; i--; ++s)
        tmp += (*s == 's') - (*s == 'p');
      res += tmp;
    
      for (n >>= 6; n--;) {
        tmp = 0;
        for (size_t i = 64; i--; ++s)
          tmp += (*s == 's') - (*s == 'p');
        res += tmp;
      }
    
      return res;
    }

That's ~55GB/s.

Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.

utopcell 1073 days ago

Indeed. I suppose the two lessons are, stick with C, and don't forget the semantics of your original problem when optimizing.

    int run_switches(const char *s) {
      int res = 0;
      uint8_t tmp = 0;
      size_t n = strlen(s);
      for (size_t i = n & 127; i--; ++s)
        tmp += (*s == 's');
      res += tmp;
    
      for (size_t j = n >> 7; j--;) {
        tmp = 0;
        for (size_t i = 128; i--; ++s)
          tmp += (*s == 's');
        res += tmp;
      }
    
      return 2 * res - n;
    }

nwallin 1073 days ago

Neat! Although you'll need to make a copy of `n`. The second loop will reduce the value of n to null.

Edit: Also, there's an off by one error. should be:

    #include <stddef.h>
    #include <stdint.h>
    
    int run_switches(const char *s, const size_t n) {
      int res = 0;
      uint8_t tmp = 0;
      for (int i = n & 127; i--; ++s)
        tmp += *s == 's';
      res += tmp;
    
      for (int size = n >> 7; size--;) {
        tmp = 0;
        for (int i = 128; i--; ++s)
          tmp += *s == 's';
        res += tmp;
      }
    
      return 2 * res - n + 1;
    }

~90GB/s on my machine, compared to 4.5GB/s for his best effort on his blog. So 20x as fast.

repsilat 1073 days ago

This is a wonderful thread.

Which tricks in there are worth playing around with more widely?

Is the uint8_t just "no point in using something bigger" or does it likely help the compiler? Does/can the signedness matter as well as the size?

Ditto looping downwards -- is it often likely to improve things? Can it generalize to pointer/iterator ranges, or is it often worth trying to phrase them in terms of array/index accesses instead?

I guess the compiler's unrolling heuristics generally aren't as good as that blocking "mod then div" alternative to Duff's device? Obviously taking `s` out of the loop condition is part of the magic.

Not checking the 'p' character by comparison is an easy optimization to understand.

Any places to read about this sort of thing, or any tricks or guidelines that come to mind? I write a fair bit of performance-sensitive code but it's all probably 20x slower than it could be because I have no intuition for what transformations compilers will do beyond "this prob gets inlined" etc.

xoranth 1072 days ago

> I guess the compiler's unrolling heuristics generally aren't as good as that blocking "mod then div" alternative to Duff's device? Obviously taking `s` out of the loop condition is part of the magic.

The magic in this case is the compiler autovectorizer. Making the length of the loop a loop invariant allows the autovectorizer to kick in.

The reason "blocking" by accumulating on uint8_t helps further is that it allows the compiler to accumulate on 8 bit SIMD lanes, instead 32 bit SIMD lanes. The same operation on 8 bit SIMD lanes will, to a first approximation, do x4 the work per cycle.

zokier 1072 days ago

> Is the uint8_t just "no point in using something bigger" or does it likely help the compiler? Does/can the signedness matter as well as the size?

In a good world you could use just uint_fast8_t and compiler would optimize this question for you. In real world I don't think compilers are smart enough, or there are too many other constraints limiting them :(

nwallin 1072 days ago

Replying to my own post: The off by 1 error was incorrect. It's because I was calling the function wrong. I had been giving it the size of the buffer, not the size of the string.

Also, someone else figured out that we can just use an and instruction instead of cmp. That gives us this version:

    #include <stddef.h>
    #include <stdint.h>

    int run_switches(const char *s, const size_t n) {
      int res = 0;
      uint8_t tmp = 0;
      for (int i = n & 127; i--; ++s)
        tmp += 1 & *s;
      res += tmp;

      for (int i = n >> 7; i--;) {
        tmp = 0;
        for (int j = 128; j--; ++s)
          tmp += 1 & *s;
        res += tmp;
      }

      return 2 * res - n;
    }

This is 111GB/s, up from 4.5GB/s in the blog. I'm going to try really hard to put this problem down now and work on something more productive.

utopcell 1072 days ago

ANDs vs cmps seem to be a mixed bag. They are faster on my older Broadwell system (E5-2690V4 / 128GiB RAM) but they are actually consistently slower on my Rome system (AMD EPYC 7B12 / 512GiB RAM). Of course, neither Broadwells nor Romes have AVX512, so likely this is where you're getting the win from.

utopcell 1072 days ago

Fascinating. Thank you for these exchanges, and @414owen for the original posts. This was fun. :-)

redf1sh 1072 days ago

I don't understand something. What does n&127 and n>>7 mean here?

skavi 1073 days ago

Am I missing something, or does this not really account for alignment? Is the compiler doing smarter loop splitting?

nwallin 1072 days ago

You're correct, it does not account for alignment.

The reason it helps performance is because it allows the compiler to accumulate in byte sized SIMD variables instead of int sized SIMD variables. My system has AVX-512 so 64 byte wide SIMD registers. With the non-blocking version, the compiler will load 16 chars into ints in a 64 byte ZMM register, then check if it's an 's', and then increment if so. With the blocked version, with the uint8_t tmp variable, the compiler will load 64 chars into uint8_ts in a 64 byte ZMM register instead. But there's a problem; we're gonna overflow the variables. So the compiler will stop every 128 iterations, and then move the 64 byte uint8_t accumulation variable into 4 64 byte int accumlations registers and sum them all up. Then do the next 128 iterations.

I'm pretty sure a similar thing will happen with SSE or AVX2 but I didn't check.

Tuna-Fish 1072 days ago

I think it's just reading unaligned. That's just a ~2x loss of throughput from L1, but the second the problem is large enough that the work being done doesn't reliably fit into the L1, that doesn't matter a bit anymore.

In general for x86, unaligned writes are worth doing work to avoid, but reads are in most situations not really an issue.

utopcell 1073 days ago

Bummer! Edited the answer. Not sure about the off-by-one though. Say the string is str[] = "spp\0". n = strlen(str) is 3. In the end, res would be 1 and 2 * res - n == -1.

nwallin 1073 days ago

Oh. Found it. It's because I wasn't using strlen and had been passing over the length of the buffer instead of the length of the string. Only my code had the off by 1.

smarnach 1072 days ago

This makes the assumption that the only characters in the string are "s" and "p". There is no basis for this assumption. I think this code solves a different problem rather than being an optimisation of the original code.

utopcell 1072 days ago

The string can only contain 's' or 'p' if you examine how it is constructed in bench.c, and taking that into consideration yields another ~2x speedup.

teo_zero 1073 days ago

But this is not the original problem! Only p's should decrease the counter, in your code every non-s does.

utopcell 1072 days ago

The original problem was working on strings that only hold 's' and 'p' characters, as seen in bench.c. The first implementation checked against 's' and 'p' specifically, and all subsequent version optimized that first version.

magicalhippo 1072 days ago

Another good reason to write optimization-friendly C (or similar) over assembly code, especially in libraries, is that the compiler will evolve with CPUs, while the assembly code will not.

I've seen plenty of cases where replacing hand-written assembly with C (or similar) lead to a substantial performance increase because the assembly code was written for some old CPU and not the best way of doing things on current CPUs.

rajnathani 1071 days ago

This seems like the most efficient solution. I have a neighboring comment on this post which suggests using bit arithmetic, but the above solution is more efficient than that. Here’s what the assembly code for the body of the first loop compiles down to (I had to use ChatGPT-4 as godbolt unfortunately doesn’t work on mobile):

    cmp dl, 's'    ; Compare the character with 's'
    sete dl        ; If the character is 's', set dl to 1. Otherwise, set dl to 0.
    sub al, dl     ; Subtract the result from res

    cmp dl, 'p'    ; Compare the character with 'p'
    sete dl        ; If the character is 'p', set dl to 1. Otherwise, set dl to 0.
    add al, dl     ; Add the result to res

SleepyMyroslav 1072 days ago

>Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.

Thank you. I hope people who post random assembly listings on HN written in some extinct ISA will read your posts.

shusaku 1073 days ago

You forgot an important line of the code:

/* DON’T REFACTOR THIS FOR READABILITY IT WILL SLOW DOWN */

elcritch 1073 days ago

Nice! I tried it in Nim and it appears to trigger it with:

    {.overflowChecks:off.}
    proc run_switches*(input: cstring): int {.exportc.} =
      result = 0
      for c in input:
        result.inc int('s' == c)
        result.dec int('p' == c)

That gives a ~5x speedup on an Apple M1. Keeping overflow checks on only gets it up to ~2x the default C version. Always nice to know good ways to trigger SIMD opts.

jonny_eh 1073 days ago

> But you really really probably shouldn't.

Shouldn't "not" keep track of string length?

nwallin 1073 days ago

Err... yes. You shouldn't not keep track of string/buffer sizes.