|
|
|
|
|
by janwas
748 days ago
|
|
For very wide, I'm thinking of Semidynamic's 2048-bit HW, which with LMUL=8 gives 2048 byte vectors, or the NEC vector machines. AFAIK it has not been publicly disclosed why Intel did not get AVX-512 into their e-cores, and I heard surprise and anger over this decision. AMD's version of them (Zen4c) are a proof that it is achievable. I am personally happy with the performance of AMD Genoa e.g. for Gemma.cpp; f32 multipliers are not a bottleneck. > The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation Perhaps, though on VQSort it was more like 50% the performance. And if so, it's more likely due to the astonishingly anemic memory BW on current x86 servers. Bolting on more cores for ever more imbalanced systems does not sound like progress to me, except for poorly optimized, branch-heavy code. |
|
I looked at the paper and my interpretation is that the performance delta between M1 (Neon) and the Xeon (AVX2) can be fully explained by the difference in clock (3.7 vs 3.3 Ghz) and the difference in L1D bandwidth (48byes/cycle vs. 128bytes/cycle). I don't see any evidence here that narrow SIMD is less efficient.
The AVX-512 is much faster, but that is because it has hardware features (most importantly, compact) that are central to the algorithm. On AVX2 and Neon these are emulated with slower sequences.