|
|
|
|
|
by orlp
1072 days ago
|
|
I made a variant that is (on my Apple m1 machine) 20x faster than the naive C version in the blog by branchlessly processing the string word-by-word: int run_switches(const char* input) {
int res = 0;
// Align to word boundary.
while ((uintptr_t) input % sizeof(size_t)) {
char c = *input++;
res += c == 's';
res -= c == 'p';
if (c == 0) return res;
}
// Process word-by-word.
const size_t ONES = ((size_t) -1) / 255; // 0x...01010101
const size_t HIGH_BITS = ONES << 7; // 0x...80808080
const size_t SMASK = ONES * (size_t) 's'; // 0x...73737373
const size_t PMASK = ONES * (size_t) 'p'; // 0x...70707070
size_t s_accum = 0;
size_t p_accum = 0;
int iters = 0;
while (1) {
// Load word and check for zero byte.
// (w - ONES) & ~w has the top bit set in each byte where that byte is zero.
size_t w;
memcpy(&w, input, sizeof(size_t));
if ((w - ONES) & ~w & HIGH_BITS) break;
input += sizeof(size_t);
// We reuse the same trick as before, but XORing with SMASK/PMASK first to get
// exactly the high bits set where a byte is 's' or 'p'.
size_t s_high_bits = ((w ^ SMASK) - ONES) & ~(w ^ SMASK) & HIGH_BITS;
size_t p_high_bits = ((w ^ PMASK) - ONES) & ~(w ^ PMASK) & HIGH_BITS;
// Shift down and accumulate.
s_accum += s_high_bits >> 7;
p_accum += p_high_bits >> 7;
if (++iters >= 255 / sizeof(size_t)) {
// To prevent overflow in our byte-wise accumulators we must flush
// them every so often. We use a trick by noting that 2^8 = 1 (mod 255)
// and thus a + 2^8 b + 2^16 c + ... = a + b + c (mod 255).
res += s_accum % 255;
res -= p_accum % 255;
iters = s_accum = p_accum = 0;
}
}
res += s_accum % 255;
res -= p_accum % 255;
// Process tail.
while (1) {
char c = *input++;
res += c == 's';
res -= c == 'p';
if (c == 0) break;
}
return res;
}
Fun fact: the above is still 1.6x slower (on my machine) than the naive two-pass algorithm that gets autovectorized by clang: int run_switches(const char* input) {
size_t len = strlen(input);
int res = 0;
for (size_t i = 0; i < len; ++i) {
char c = input[i];
res += c == 's';
res -= c == 'p';
}
return res;
}
|
|
You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.
A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.