Hacker News new | ask | show | jobs
by Sesse__ 466 days ago
You don't have any guarantee that the SIMD operations are actually done in parallel (which is of of the assumptions needed for “the latency matches that of the slowest element”). E.g., I believe VPGATHERDD is microcoded (and serial) on some CPUs, NEON (128-bit) is designed to be efficiently implementable by running a 64-bit ALU twice, AVX2 and AVX512 double-pumped on some (many) CPUs likewise…