Hacker News new | ask | show | jobs
by corsix 1421 days ago
Given latency 3 / throughput 1, the only reasonable implementations are: A) Three ports, each non-pipelined, taking 3 cycles B) One port, with three-cycle pipeline (each cycle, one instruction can enter the start of the pipeline, and anything in-progress moves forward one stage) Given that CRC isn't too hard to pipeline, and (B) requires less physical hardware, it is almost certainly (B).
1 comments

From the port usage reported at https://uops.info/html-instr/CRC32_R64_R64.html, we can conclude (B) for the Intel microarchitectures. For AMD it's not entirely obvious, but I agree (B) appears more likely.