Hacker News new | ask | show | jobs
by RaisingSpear 602 days ago
I suspect Intel uses 32x32b multipliers instead of his theorised 16x16b, just that it only has one every second lane. It lines up more closely with VPMULLQ, and it seems odd that PMULUDQ would be one uOp vs PMULLD's two.

PMULLD is probably just doing 2x PMULUDQ and discarding the high bits.

(I tried commenting on his blog but it's awaiting moderation - I don't know if that's ever checked, or just sits in the queue forever)

1 comments

Makes sense to me. I have some code that uses a lot of mullo, so I get to pay twice the latency compared to if I wanted full multiplies...