Like my sibling posted, the crazy CISCy instructions aren’t comparable because in general they were no faster than an equivalent sequence of simpler instructions. That’s not the case for permute; there are no “simpler” instructions that let you build an efficient permute. It’s one the fundamental building blocks for efficient vector code -- that’s why it’s shocking that it was added to SSE so late.
The point is that bit twiddling can be much more efficient to implement in hardware because all you're doing is placing wires somewhere. The RBIT instruction in the article significantly speeds up an operation at very low hardware cost.
Polynomial evaluation does not fit into this pattern, because you need actual arithmetic operations to do it, and so a hardware polynomial evaluation instruction has no significant benefit over the corresponding sequence of explicit multiplications and additions.