|
|
|
|
|
by ribit
749 days ago
|
|
> Perhaps, though on VQSort it was more like 50% the performance. I looked at the paper and my interpretation is that the performance delta between M1 (Neon) and the Xeon (AVX2) can be fully explained by the difference in clock (3.7 vs 3.3 Ghz) and the difference in L1D bandwidth (48byes/cycle vs. 128bytes/cycle). I don't see any evidence here that narrow SIMD is less efficient. The AVX-512 is much faster, but that is because it has hardware features (most importantly, compact) that are central to the algorithm. On AVX2 and Neon these are emulated with slower sequences. |
|
Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?