Hacker News new | ask | show | jobs
by 323 1419 days ago
But the question was how many execution ports it has, not what's the latency/throughput of one port.
1 comments

Given latency 3 / throughput 1, the only reasonable implementations are: A) Three ports, each non-pipelined, taking 3 cycles B) One port, with three-cycle pipeline (each cycle, one instruction can enter the start of the pipeline, and anything in-progress moves forward one stage) Given that CRC isn't too hard to pipeline, and (B) requires less physical hardware, it is almost certainly (B).
From the port usage reported at https://uops.info/html-instr/CRC32_R64_R64.html, we can conclude (B) for the Intel microarchitectures. For AMD it's not entirely obvious, but I agree (B) appears more likely.