| HN Mirror

Indeed Zen 4 splits uops just as it passes them to pipes, but Zen 4 is already doing that, adding more ports doesn't mean you can do it twice (without, like, making those ports 128-bit (thus not gaining any throughput), or making a new AVX-1024).

Allowing accessing separate parts of 512-bit pipes makes sense, but that still then needs separate ports for each half, otherwise there's nothing to schedule the other half to. uops.info data[0] shows that 256-bit shuffle throughput is indeed double that of 512-bit, but seemingly both still increment either the FP1 or FP2 port (these overlap the regular four ALU port numbers!) so the AVX2 shuffles still have two ports to taget.

So the mapping between Zen 4's (perf-counter-indicated) ports is rather unrelated from available execution units (not in any way a new concept, but still interesting). Which would seem to indicate that perhaps like "vaddps zmm; vpermd zmm" can manage 1/cycle, while "vaddps ymm; vaddps ymm; vpermd ymm; vpermd ymm" would fight for FP2 (for reference, vaddps uses either FP2 or FP3)? Fun.

[0]: https://uops.info/table.html?search=%22vpermd%20%22&cb_lat=o...