|
|
|
|
|
by menaerus
589 days ago
|
|
> Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core. Isn't it that zen4 doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units? Because of this, a single AVX-512 instruction will occupy both FMA units and therefore I think that the theoretical limit for a single zen4 core should be half of the 134 GFLOPS number? |
|
According to uops.info, Zen 4 cores can do two 8-wide FMA instructions per cycle, or one 16-wide FMA per cycle. See VFMADD132PS (YMM, YMM, YMM) and VFMADD132PS (ZMM, ZMM, ZMM) respectively, the throughput column is labelled TP. That’s where 32 FLOP/cycle number comes from.
> doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units
That’s correct, AVX512 doesn’t deliver more FLOPs on that CPU. The throughput of 32-byte FMA and 64-byte FMA is the same, 32 FLOP/cycle for FP32 numbers.