Also amd avx2 machines faster than intel one but the current benchmarks probably didnt have hardware on hand, googles highway numbers also intel based, I wanted to test also, i got machine but having laziest days :)
Given latency 3 / throughput 1, the only reasonable implementations are:
A) Three ports, each non-pipelined, taking 3 cycles
B) One port, with three-cycle pipeline (each cycle, one instruction can enter the start of the pipeline, and anything in-progress moves forward one stage)
Given that CRC isn't too hard to pipeline, and (B) requires less physical hardware, it is almost certainly (B).
From the port usage reported at https://uops.info/html-instr/CRC32_R64_R64.html, we can conclude (B) for the Intel microarchitectures.
For AMD it's not entirely obvious, but I agree (B) appears more likely.