|
|
|
|
|
by camel-cdr
739 days ago
|
|
You are probably right about the bypass network, but I don't see why ROB or decode would need to increase. Aren't avx512 instructions only "split" when already at a pipe in zen4?
Also, my understanding was that the cpu can schedule avx2 instructions to the upper and lower part of the 512 wide pipes. |
|
Allowing accessing separate parts of 512-bit pipes makes sense, but that still then needs separate ports for each half, otherwise there's nothing to schedule the other half to. uops.info data[0] shows that 256-bit shuffle throughput is indeed double that of 512-bit, but seemingly both still increment either the FP1 or FP2 port (these overlap the regular four ALU port numbers!) so the AVX2 shuffles still have two ports to taget.
So the mapping between Zen 4's (perf-counter-indicated) ports is rather unrelated from available execution units (not in any way a new concept, but still interesting). Which would seem to indicate that perhaps like "vaddps zmm; vpermd zmm" can manage 1/cycle, while "vaddps ymm; vaddps ymm; vpermd ymm; vpermd ymm" would fight for FP2 (for reference, vaddps uses either FP2 or FP3)? Fun.
[0]: https://uops.info/table.html?search=%22vpermd%20%22&cb_lat=o...