|
|
|
|
|
by pbsd
4706 days ago
|
|
I have no idea how to simulate a variable PALIGNR on Intel chips without making the loop extremely slow. On AMD (with XOP), it can be done using VPPERM, which can shuffle from 2 sources. We can do variable alignment like this: vpperm xmm0, xmm1, xmm2, [[0..31] + offset]
On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend. |
|
I tried for a bit, but haven't figured out how to make that work. PSHUFB needs a different XMM operand for each 'rotate'. Loading this operand would take 6 cycles, and I haven't thought of a clever way of generating it in less.
I do greatly appreciate your help, though. Thanks!