Hacker News new | ask | show | jobs
by titzer 1926 days ago
Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle on average. 1 cycle on average for any kind of loop is just flat out suspicious.

This requires a lot more digging to understand.

Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.

6 comments

This is the benchmarking loop:

  for (size_t i = 0; i < N; i++) {
    out[i++] = g();
  }
N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.

This is also visible in the assembly

  add     x8, x8, #2
So if I see this correctly the results are off by a factor of 2.

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

Yes, the i++ seems an oversight.

The relative speed between the two hashes is still the same, but it is no longer one iteration per cycle.

> Update: The numbers were updated since they were off by a factor of two due to a typographical error in the code.

The article got updated by now :)

A C for statement is a "benchmarking loop" in the same sense that "slice a sponge cake into two layers and place custard in the middle" is an actionable dessert recipe.
Failing to post disassembly for a micro benchmark is annoying.

It is of course speculating all the way through the loop; a short backwards conditional branch will be speculated as "taken" by even very simple predictors.

Op fusion is very likely, as is register renaming: I suspect that "mul" always computes both products, and the upper one is left in a register which isn't visible to the programmer until they use "mulh" with the same argument. At which point it's just renamed into the target register.

The dependency chain is state += 0x60bee2bee120fc15ull or (state += UINT64_C(0x9E3779B97F4A7C15)); the rest of the calculations are independent per iteration.

Anyway, the more important fact is that 64x64b -> 128b mul might be one instruction on x86, but it's broken into 2 µops. Because modern CPUs generally don't design around µops being able to write two registers in the same set.

It's a shame we can't see the rest of the code. What is happening to the result value? Is it being compared to something? Put into an array, or what? All of that code probably totally outweighs what you pointed out here. Or, at least it should. I have a bad feeling it might be being dead-code eliminated, since compilers are super aggressive about that nowadays, but I hope he's somehow controlled for that.
The blog post links to the benchmark... It's repeatedly populating a 20k entry array with the results.

godbolt clang compiles it to:

    .LBB5_2:                                // =>This Inner Loop Header: Depth=1
        mul     x13, x11, x10
        umulh   x14, x11, x10
        eor     x13, x14, x13
        mul     x14, x13, x12
        umulh   x13, x13, x12
        eor     x13, x13, x14
        str     x13, [x0, x8, lsl #3]
        add     x8, x8, #2                      // =2
        cmp     x8, x1
        add     x11, x11, x9
        b.lo    .LBB5_2
[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
Thanks for the link.

Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.

edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.

I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput.

Based on the information from [1] we have something like this for both loops:

    .LBB0_2:
            eor     x13, x9, x9, lsr #30     # 2 \* p1-6
            mul     x13, x13, x11            # 1 \* p5-6
            eor     x13, x13, x13, lsr #27   # 2 \* p1-6
            mul     x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x13, lsr #31   # 2 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB0_2                  # Fused into 1 \* p1-3
                                             # Total: 11 uops

    .LBB1_2:
            mul     x13, x9, x11             # 1 \* p5-6
            umulh   x14, x9, x11             # 1 \* p5-6
            eor     x13, x14, x13            # 1 \* p1-6
            mul     x14, x13, x12            # 1 \* p5-6
            umulh   x13, x13, x12            # 1 \* p5-6
            eor     x13, x13, x14            # 1 \* p1-6
            str     x13, [x0, x10, lsl #3]   # 1 \* p7-8
            add     x13, x10, #2             # 1 \* p1-6
            add     x9, x9, x8               # 1 \* p1-6
            mov     x10, x13                 # none
            cmp     x13, x1                  #
            b.lo    .LBB1_2                  # Fused into 1 \* p1-3
                                             # Total: 10 uops
Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.

Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.

However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.

[1] https://dougallj.github.io/applecpu/firestorm.html

I think you are right that mull/h are fused. I think that M1 has 128 ALUs for the vector unit, so it would be a good way to make use of them. M1 is far from the first iteration of the architecture and Apple has likely picked most if not all low hanging fruits. It also helps x86 emulation I guess.

edit: but see the comment else thread about the loop iteration time being off by a factor of 2.

Perhaps there was a misunderstanding when executives talked to the silicon engineers about "the unbelievable speculation for our first in-house desktop-class CPU"?
I don't think a large rob is a significant contributor here.

The very wide execution is though.

I would be wary of using gettimeofday to measure such short periods. As https://pubs.opengroup.org/onlinepubs/009604599/functions/ge... says, the resolution of the system clock is unspecified, and 20,000 × ≈10 ≈ 200k instructions easily run in under a millisecond on modern hardware.

The benchmark probably get rids of that by doing it 40,000 times in quick succession, but why not measure the time of all 40,000 iterations in one go, and decrease the risk (and the overhead of calling gettimeofday 39,999 times)?

The resolution of the clock is unspecified by the POSIX standard but that does not mean it's unspecified by the platform this is actually running on. If this was trying to be portable code you'd have a big issue there but it's not. It is still limited by gettimeofday maxing out at microsecond precision, though, which is quite poor. And seems to be using a realtime clock, which introduces errors from network time sync & such. That's unlikely to crop up here, but it's still a risk. clock_gettime_nsec_np(CLOCK_MONOTONIC) is what should be used here.
Agree, you probably want to just read some cpu time stamp counter before and after, and make sure you ar pinning your threwads, locking the clocks, etc. so that you can get a reliable time from there, or... just use cycles as your unit of measure.
On Linux and MacOS, gettimeofday is accurate to microseconds. Not only is the returned struct is expressed in microseconds, but I have personally observed that it is accurate to microseconds.
The conclusion seems based on the relative execution times for the two benchmarks. Since the benchmarks are measured in the same way, their error bars should be basically the same as well. This analysis is not an analysis of the absolute execution time of these algorithms, but the difference between them.

I don't think the conclusion is hasty. Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".

> Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".

But that doesn't follow either. Only by inspecting the machine code do we get to see what's really going on in a loop, and the ultimate result is dependent on a lot of factors: if the compiler unrolled the loop (here: no), whether there were any spills in the loop (here: no), what the length of the longest dependency chain in the loop is, how many micro-ops for the loop, how many execution ports there are in the processor, and what type, the frontend decode bandwidth (M1: seems up to 5 ins/cycle), whether there is a loop stream buffer (M1: seems no, but most intel processors, yes), the latency of L1 cache, how many loads/stores can be in-flight, etc, etc. These are the things you gotta look at to know the real answer.

At that throughput the CPU is speculating and exploiting the access pattern.

It's also worth saying that if Apple were dead set on throughput in this area they could've implemented some non-trivial fusion to improve performance. I don't have an M1 so I can't find out for you (and Apple are steadfast on not documenting anything about the microarchitecture...)

Totally agree. I was thinking he'd get there, and then the post abruptly ended.