|
IMHO the original code wasn't written in a way that's particularly friendly to compilers. If you write it like this: int run_switches_branchless(const char* s) {
int result = 0;
for (; *s; ++s) {
result += *s == 's';
result -= *s == 'p';
}
return result;
}
...the compiler will do all the branchless sete/cmov stuff as it sees fit. It will be the same speed as the optimized assembly in the post, +/- something insignificant. However it won't unroll and vectorize the loop. If you write it like this: int run_switches_vectorized(const char* s, size_t size) {
int result = 0;
for (; size--; ++s) {
result += *s == 's';
result -= *s == 'p';
}
return result;
}
It will know the size of the loop, and will unroll it and use AVX-512 instructions if they're available. This will be substantially faster than the first loop for large inputs, although I'm too lazy to benchmark just how much faster it is.Now, this requires knowing the size of your string in advance, and maybe you're the sort of C programmer who doesn't keep track of how big your strings are. I'm not your coworker, I don't review your code. Do what you want. But you really really probably shouldn't. https://godbolt.org/z/rde51zMd8 |
It achieves 3.88GiB/s
I intentionally didn't go down the route of vectorizing. I wanted to keep the scope of the problem small, and show off the assembly tips and tricks in the post, but maybe there's potential for a future post, where I pad the input string and vectorize the algorithm :)