|
|
|
|
|
by nicula
459 days ago
|
|
Some people already mentioned this in the r/cpp discussion. Small correction: 256 is not the correct number of iterations, since if all elements in that slice are even, then your 8-bit counter will wrap-around to zero, which can lead to a wrong answer. What you want is 255 iterations. I've looked at the generated assembly for such a solution and it doesn't look great. I'm expecting a significant speed penalty, but I haven't had the time to test it today. Will probably do so tomorrow. |
|
This is still lower than optimal because the compiler will reduce to a uint8. Both SSE2 and NEON support reducing to a wider value by _mm_sad_epu8 and vpadal_u8, respectively. This allows for 255 iterations in the inner loop instead of 15 or 7.