| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by orlp 1072 days ago

I made a variant that is (on my Apple m1 machine) 20x faster than the naive C version in the blog by branchlessly processing the string word-by-word:

    int run_switches(const char* input) {
        int res = 0;

        // Align to word boundary.
        while ((uintptr_t) input % sizeof(size_t)) {
            char c = *input++;
            res += c == 's';
            res -= c == 'p';
            if (c == 0) return res;
        }

        // Process word-by-word.
        const size_t ONES = ((size_t) -1) / 255;  // 0x...01010101
        const size_t HIGH_BITS = ONES << 7;       // 0x...80808080
        const size_t SMASK = ONES * (size_t) 's'; // 0x...73737373
        const size_t PMASK = ONES * (size_t) 'p'; // 0x...70707070
        size_t s_accum = 0;
        size_t p_accum = 0;
        int iters = 0;
        while (1) {
            // Load word and check for zero byte.
            // (w - ONES) & ~w has the top bit set in each byte where that byte is zero.
            size_t w;
            memcpy(&w, input, sizeof(size_t));
            if ((w - ONES) & ~w & HIGH_BITS) break;
            input += sizeof(size_t);

            // We reuse the same trick as before, but XORing with SMASK/PMASK first to get
            // exactly the high bits set where a byte is 's' or 'p'.
            size_t s_high_bits = ((w ^ SMASK) - ONES) & ~(w ^ SMASK) & HIGH_BITS;
            size_t p_high_bits = ((w ^ PMASK) - ONES) & ~(w ^ PMASK) & HIGH_BITS;

            // Shift down and accumulate.
            s_accum += s_high_bits >> 7;
            p_accum += p_high_bits >> 7;
            if (++iters >= 255 / sizeof(size_t)) {
                // To prevent overflow in our byte-wise accumulators we must flush
                // them every so often. We use a trick by noting that 2^8 = 1 (mod 255)
                // and thus a + 2^8 b + 2^16 c + ... = a + b + c  (mod 255).
                res += s_accum % 255;
                res -= p_accum % 255;
                iters = s_accum = p_accum = 0;
            }
        }
        res += s_accum % 255;
        res -= p_accum % 255;

        // Process tail.
        while (1) {
            char c = *input++;
            res += c == 's';
            res -= c == 'p';
            if (c == 0) break;
        }

        return res;
    }

Fun fact: the above is still 1.6x slower (on my machine) than the naive two-pass algorithm that gets autovectorized by clang:

    int run_switches(const char* input) {
        size_t len = strlen(input);
        int res = 0;
        for (size_t i = 0; i < len; ++i) {
            char c = input[i];
            res += c == 's';
            res -= c == 'p';
        }
        return res;
    }

3 comments

fuber2018 1072 days ago

I assume the M1's SIMD registers are wider/more numerous than just the couple of size_t registers used for the loading/masking/accumulating inner loop in your run_swtches().

You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.

A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.

link

fuber2018 1072 days ago

If this code only runs on one compiler version/CPU arch, then ASSUMING the compiler will do the RIGHT THING and auto-vectorize the code is okay.

But if your code will be cross-platform/run on different OSes/CPU arch's, then a SWAR version may be more consistently performant - no need to guess if the compiler's optimization heuristics decided to go with the general purpose CPU registers or faster SIMD registers.

Downside is that the devs are exposed to the gnarly optimized code.

link

fuber2018 1072 days ago

Almost the same as my SWAR version - which is what you're doing.

But aren't you reading off the end of the buffer in your memcpy(&w...)? Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

I just passed in the string length since the caller had that info, otherwise you'd scan the whole string again looking for the zero terminator.

link

orlp 1072 days ago

> But aren't you reading off the end of the buffer in your memcpy(&w...)?

If we go by the absolute strictest interpretation of the C standard my above implementation is UB.

But in practice, if p is word-aligned and is at least valid for 1 byte, then you will not pagefault for reading a whole word. In fact, this is how GCC/musl implement strlen itself.

> Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

Then the start address is valid (it must contain the null byte), and aligned to a word boundary, in which case I assume it is ok to also read a whole word there.

link

gpderetta 1072 days ago

If I read it correctly, your implementation might read beyond the end of the buffer, and if it crosses a page boundary into an unmapped page, it will segfault. That's one of the many evils of null terminated strings.

link

orlp 1072 days ago

If we go by the absolute strictest interpretation of the C standard, yes, the above is UB.

But in practice no one has page boundaries that cross word boundaries, and I align to a word boundary before doing the word-by-word loop.

link

gpderetta 1072 days ago

good point of course!

link