Hacker News new | ask | show | jobs
by manwe150 466 days ago
I wonder if you could have a compiler that intentionally pads every instruction to use a vector register pair of <value, slow-const>, so that every value gets paired with whatever constant or other expression will cause the slowest execution of that vector instruction and with the maximum latency chain. Otherwise, it does look like the VDIV instructions can have variable latencies based on the data itself (and not just spectre-like speculation issues on the memory access patterns).

From https://uops.info/table.html

1 comments

You don't have any guarantee that the SIMD operations are actually done in parallel (which is of of the assumptions needed for “the latency matches that of the slowest element”). E.g., I believe VPGATHERDD is microcoded (and serial) on some CPUs, NEON (128-bit) is designed to be efficiently implementable by running a 64-bit ALU twice, AVX2 and AVX512 double-pumped on some (many) CPUs likewise…