|
I think I managed to improve on both this post, and its sequel, at the cost of specializing the function for the case of a string made only of 's' and 'p'. The benchmark only tests strings made of 's' and 'p', so I think it is fair. The idea is as follow. We want to increase `res` by one when the next character is `s`. Naively, we might try something like this: res += (c - 'r'); // is `res += 1` when c == 's'
This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be -1.But `'p' - 'r'`, when viewer as an unsigned integer, underflows, setting the carry flag.
Turns out x64 has an instruction (adc) that adds two registers _plus_ the carry flag. Therefore we can replace two `cmp, cmov` with one `sub, adc`: run_switches:
xor eax, eax # res = 0
loop:
movsx ecx, byte ptr [rdi]
test ecx, ecx
je ret
inc rdi
sub ecx, 'r'
adc eax, ecx # Magic happens here
jmp loop
ret:
ret
Benchmarks are as follows (`bench-x64-8` is the asm above): Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.08 ± 0.00 times faster than '02-the-same-speed-as-c/bench-c-4-clang 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
Of course, one could improve things further using SWAR/SIMD... |
On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s for your implementation above.
I'm not sure why exactly we're seeing this discrepancy, and for what it's worth your implementation looks faster to me. I guess this is why you need to always benchmark on the hardware you're going to be running the code on...