|
The blog post links to the benchmark...
It's repeatedly populating a 20k entry array with the results. godbolt clang compiles it to: .LBB5_2: // =>This Inner Loop Header: Depth=1
mul x13, x11, x10
umulh x14, x11, x10
eor x13, x14, x13
mul x14, x13, x12
umulh x13, x13, x12
eor x13, x13, x14
str x13, [x0, x8, lsl #3]
add x8, x8, #2 // =2
cmp x8, x1
add x11, x11, x9
b.lo .LBB5_2
[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... |
Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.
edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.
I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput.