Hacker News new | ask | show | jobs
by pbsd 1926 days ago
Based on the information from [1] we have something like this for both loops:

    .LBB0_2:
            eor     x13, x9, x9, lsr #30     # 2 \* p1-6
            mul     x13, x13, x11            # 1 \* p5-6
            eor     x13, x13, x13, lsr #27   # 2 \* p1-6
            mul     x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x13, lsr #31   # 2 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB0_2                  # Fused into 1 \* p1-3
                                             # Total: 11 uops

    .LBB1_2:
            mul     x13, x9, x11             # 1 \* p5-6
            umulh   x14, x9, x11             # 1 \* p5-6
            eor     x13, x14, x13            # 1 \* p1-6
            mul     x14, x13, x12            # 1 \* p5-6
            umulh   x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x14            # 1 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB1_2                  # Fused into 1 \* p1-3
                                             # Total: 10 uops
Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.

Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.

However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.

[1] https://dougallj.github.io/applecpu/firestorm.html

1 comments

I think you are right that mull/h are fused. I think that M1 has 128 ALUs for the vector unit, so it would be a good way to make use of them. M1 is far from the first iteration of the architecture and Apple has likely picked most if not all low hanging fruits. It also helps x86 emulation I guess.

edit: but see the comment else thread about the loop iteration time being off by a factor of 2.

Oh yeah, I thought the add r,r,2 was odd but didn't investigate. This brings things back to ~2+ cycles per iteration, which strictly speaking does not require fusion.

It would be easier to test this explicitly instead of inside some unrelated RNG.