Depending on timing requirements, device type, operating speed and word width you have to add one or more layers of flip-flops to facilitate timing closure and avoid potential metastability issues.
Right, but that's true of all CPU instructions. If you already have an ALU capable of doing things like integer multiplication, would adding what is essentially a bunch of chained flip-flops really going to add much more complexity or resource usage?