|
Thinking a little further about this, I believe using PSHUFB is the way to go, at least for when the count is large. This is because we can do 2 iterations in essentially one go (haven't tested the code, it's mostly a sketch): vmovdqa xmm0, [0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6]
vmovdqa xmm15, [0x0f, 0x0f, ..., 0x0f]
vmovdqu xmm7, [rdi]
_loop_body:
vpand xmm8, xmm15, [rdi]
vpsrlw xmm9, xmm7, 4
vpand xmm9, xmm9, xmm15
vpshufb xmm8, xmm0, xmm8
vpshufb xmm9, xmm0, xmm9
vpaddb xmm8, xmm8, xmm9
vpshufb xmm7, xmm8, xmm8 ; since sum <= 12, we already have the next sum in the vector!
; xmm7[0] = xmm8[xmm8[0]]
vpaddb xmm8, xmm8, xmm7 ; add it
vpextrb eax, xmm8, 0
vmovdqu xmm7, [rdi + rax]
add rdi, rax
sub esi, 2
jnz _loop_body
This is likely extendable to 32-byte vectors with AVX2; have not thought much about that case. |