| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by titzer 1926 days ago

Thanks for the link.

Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.

edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.

I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput.

3 comments

pbsd 1926 days ago

Based on the information from [1] we have something like this for both loops:

    .LBB0_2:
            eor     x13, x9, x9, lsr #30     # 2 \* p1-6
            mul     x13, x13, x11            # 1 \* p5-6
            eor     x13, x13, x13, lsr #27   # 2 \* p1-6
            mul     x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x13, lsr #31   # 2 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB0_2                  # Fused into 1 \* p1-3
                                             # Total: 11 uops

    .LBB1_2:
            mul     x13, x9, x11             # 1 \* p5-6
            umulh   x14, x9, x11             # 1 \* p5-6
            eor     x13, x14, x13            # 1 \* p1-6
            mul     x14, x13, x12            # 1 \* p5-6
            umulh   x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x14            # 1 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB1_2                  # Fused into 1 \* p1-3
                                             # Total: 10 uops

Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.

Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.

However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.

[1] https://dougallj.github.io/applecpu/firestorm.html

link

gpderetta 1926 days ago

I think you are right that mull/h are fused. I think that M1 has 128 ALUs for the vector unit, so it would be a good way to make use of them. M1 is far from the first iteration of the architecture and Apple has likely picked most if not all low hanging fruits. It also helps x86 emulation I guess.

edit: but see the comment else thread about the loop iteration time being off by a factor of 2.

link

pbsd 1925 days ago

Oh yeah, I thought the add r,r,2 was odd but didn't investigate. This brings things back to ~2+ cycles per iteration, which strictly speaking does not require fusion.

It would be easier to test this explicitly instead of inside some unrelated RNG.

link

dwaite 1926 days ago

Perhaps there was a misunderstanding when executives talked to the silicon engineers about "the unbelievable speculation for our first in-house desktop-class CPU"?

link

gpderetta 1926 days ago

I don't think a large rob is a significant contributor here.

The very wide execution is though.

link