|
|
|
|
|
by titzer
1926 days ago
|
|
Thanks for the link. Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck. edit: I was looking at similar code generated by clang on my machine. Again, holy fuck. I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput. |
|
Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.
However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.
[1] https://dougallj.github.io/applecpu/firestorm.html