|
|
|
|
|
by echlebek
1926 days ago
|
|
The conclusion seems based on the relative execution times for the two benchmarks. Since the benchmarks are measured in the same way, their error bars should be basically the same as well. This analysis is not an analysis of the absolute execution time of these algorithms, but the difference between them. I don't think the conclusion is hasty. Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't". |
|
But that doesn't follow either. Only by inspecting the machine code do we get to see what's really going on in a loop, and the ultimate result is dependent on a lot of factors: if the compiler unrolled the loop (here: no), whether there were any spills in the loop (here: no), what the length of the longest dependency chain in the loop is, how many micro-ops for the loop, how many execution ports there are in the processor, and what type, the frontend decode bandwidth (M1: seems up to 5 ins/cycle), whether there is a loop stream buffer (M1: seems no, but most intel processors, yes), the latency of L1 cache, how many loads/stores can be in-flight, etc, etc. These are the things you gotta look at to know the real answer.