Hacker News new | ask | show | jobs
by owlbite 851 days ago
Focussing on vector length is an error. You should care about throughput and number of different instructions you can retire.

If I have a core with four 128-bit neon vector units, I have the same throughput as an x86 with two AVX2 units or one AVX512 unit. However that 4x128-bit core is actually more flexible than the other two as I can do 4 different things at once, or 4 scalar operations per cycle. (Of course the downside is you spend more frontend resources on decode).

Given that most code isn't vector code, the multiple short vector length approach is actually superior on many common real-world workloads that aren't machine-learning (and CPU is unit-of-last-resort for large ML workloads anyway).