|
|
|
|
|
by nwallin
1073 days ago
|
|
Neat! Although you'll need to make a copy of `n`. The second loop will reduce the value of n to null. Edit: Also, there's an off by one error. should be: #include <stddef.h>
#include <stdint.h>
int run_switches(const char *s, const size_t n) {
int res = 0;
uint8_t tmp = 0;
for (int i = n & 127; i--; ++s)
tmp += *s == 's';
res += tmp;
for (int size = n >> 7; size--;) {
tmp = 0;
for (int i = 128; i--; ++s)
tmp += *s == 's';
res += tmp;
}
return 2 * res - n + 1;
}
~90GB/s on my machine, compared to 4.5GB/s for his best effort on his blog. So 20x as fast. |
|
Which tricks in there are worth playing around with more widely?
Is the uint8_t just "no point in using something bigger" or does it likely help the compiler? Does/can the signedness matter as well as the size?
Ditto looping downwards -- is it often likely to improve things? Can it generalize to pointer/iterator ranges, or is it often worth trying to phrase them in terms of array/index accesses instead?
I guess the compiler's unrolling heuristics generally aren't as good as that blocking "mod then div" alternative to Duff's device? Obviously taking `s` out of the loop condition is part of the magic.
Not checking the 'p' character by comparison is an easy optimization to understand.
Any places to read about this sort of thing, or any tricks or guidelines that come to mind? I write a fair bit of performance-sensitive code but it's all probably 20x slower than it could be because I have no intuition for what transformations compilers will do beyond "this prob gets inlined" etc.