Hacker News new | ask | show | jobs
by nwallin 1073 days ago
Replying to my own post: The off by 1 error was incorrect. It's because I was calling the function wrong. I had been giving it the size of the buffer, not the size of the string.

Also, someone else figured out that we can just use an and instruction instead of cmp. That gives us this version:

    #include <stddef.h>
    #include <stdint.h>

    int run_switches(const char *s, const size_t n) {
      int res = 0;
      uint8_t tmp = 0;
      for (int i = n & 127; i--; ++s)
        tmp += 1 & *s;
      res += tmp;

      for (int i = n >> 7; i--;) {
        tmp = 0;
        for (int j = 128; j--; ++s)
          tmp += 1 & *s;
        res += tmp;
      }

      return 2 * res - n;
    }
This is 111GB/s, up from 4.5GB/s in the blog. I'm going to try really hard to put this problem down now and work on something more productive.
3 comments

ANDs vs cmps seem to be a mixed bag. They are faster on my older Broadwell system (E5-2690V4 / 128GiB RAM) but they are actually consistently slower on my Rome system (AMD EPYC 7B12 / 512GiB RAM). Of course, neither Broadwells nor Romes have AVX512, so likely this is where you're getting the win from.
Fascinating. Thank you for these exchanges, and @414owen for the original posts. This was fun. :-)
I don't understand something. What does n&127 and n>>7 mean here?
127 is 128-1 or 2^7-1, or 1111111b (in binary). It is a faster way to compute n%D when D is known to be a power-of-two.

n>>7 is equal to n/(2^7), and is a faster way to divide with a power-of-two.

The code in question has to process a string of variable length.

But the compiler/CPU can process bytes one at a time or much faster in groups. The code is trying to process as much as possible in groups of 128.

But since the caller can pass in a string which is not a mulitple of 128 chars, the first for-loop (& 127) will figure out how much of the string to process such that the remaining string length is a multiple of 128.

The second for-loop (>> 7) calculates divides by 128 (>> 7) to find out how many multiples of 128 there are to process. The inner for-loop processes 128 chars looking for 's' chars.

Now the for-loop within a for-loop doesn't look any faster than the plain single for-loop, but I'd assume that the heuristics of certain compilers can intuit that it can generate code to operate on multiple chars at the same time (SIMD instructions), since the result of one operation are independent of others.

On a compiler that cannot generate SIMD code, the code won't be much faster, if at all, than the naive straightforward manner.