|
|
|
|
|
by RaisingSpear
602 days ago
|
|
I suspect Intel uses 32x32b multipliers instead of his theorised 16x16b, just that it only has one every second lane.
It lines up more closely with VPMULLQ, and it seems odd that PMULUDQ would be one uOp vs PMULLD's two. PMULLD is probably just doing 2x PMULUDQ and discarding the high bits. (I tried commenting on his blog but it's awaiting moderation - I don't know if that's ever checked, or just sits in the queue forever) |
|