|
|
|
|
|
by ack_complete
460 days ago
|
|
255 is not ideal either because it results in a partial vector at the end of either 15 or 31 elements. 256-V where V is the vector size is better, so 240 for SSE2/NEON or 224 for AVX2. This is still lower than optimal because the compiler will reduce to a uint8. Both SSE2 and NEON support reducing to a wider value by _mm_sad_epu8 and vpadal_u8, respectively. This allows for 255 iterations in the inner loop instead of 15 or 7. |
|
I wrote the code that you suggested (LMK if I understood your points): https://godbolt.org/z/jW4o3cnh3
And here's the benchmark output, on my machine: https://0x0.st/8SsG.txt (v1 is std::count_if(), v2 is the optimization from my blog post, and v3 is what you suggested).
v2 is faster, but v3 is still quite fast.