Interestingly, while x86-64 does not seem to have a single opcode for reversing bits in a byte, it has a function to arbitrarily shuffle around the 16 bytes in a 128bit SSE register [PSHUFB]. It just blows my mind how much data those SIMD instructions process or move around in relatively few clock-cycles.
It’s actually shocking how long it took Intel to add PSHUFB to SSE. Altivec (PPC) had the even-more-powerful vperm (arbitrary shuffle mapping 32B to 16B) way back in 1999.
Like my sibling posted, the crazy CISCy instructions aren’t comparable because in general they were no faster than an equivalent sequence of simpler instructions. That’s not the case for permute; there are no “simpler” instructions that let you build an efficient permute. It’s one the fundamental building blocks for efficient vector code -- that’s why it’s shocking that it was added to SSE so late.
The point is that bit twiddling can be much more efficient to implement in hardware because all you're doing is placing wires somewhere. The RBIT instruction in the article significantly speeds up an operation at very low hardware cost.
Polynomial evaluation does not fit into this pattern, because you need actual arithmetic operations to do it, and so a hardware polynomial evaluation instruction has no significant benefit over the corresponding sequence of explicit multiplications and additions.