Hacker News new | ask | show | jobs
by unnah 615 days ago
A Zen 5 core has four parallel AVX-512 execution units, so it should be able to execute 128 16-bit operations in parallel, or over 24k on 192 cores. However I think the 192-core processors use the compact variant core Zen 5c, and I'm not sure if Zen 5c is quite as capable as the full Zen 5 core.
1 comments

Right, I found this interesting as a thought exercise and took it from another angle.

Since it takes 4 cycles to execute FMA on double-precision 64-bit floats (VFMADD132PD) this translates to 1.25G ops/s (GFLOPS/s) per each core@5GHz. At 192 cores this is 240 GFLOPS/s. For a single FMA unit. At 2x FMA units per core this becomes 480 GFLOPS/s.

For 16-bit operations this becomes 1920 GFLOPS/s or 1.92 TFLOPS/s for FMA workloads.

Similarly, 16-bit FADD workloads are able to sustain more at 2550 GFLOPS/s or 2.55 TFLOPS/s since the FADD is a bit cheaper (3 cycles).

This means that for combined half-precision FADD+FMA workloads zen5 at 192 cores should be able to sustain ~4.5 TFLOPS/s.

Nvidia H100 OTOH per wikipedia entries, if correct, can sustain 50-65 TFLOP/s at single-precision and 750-1000 TFLOPS/s at half-precision. Quite a difference.

The execution units are fully pipelined, so although the latency is four cycles, you can receive one result every cycle from each of the execution units.

For a Zen 5 core, that means 16 double precision FMAs per cycle using AVX 512, so 80gflop per core at 5ghz, or twice that using fp32

You're absolutely right, not sure why I dumbed down my example to a single instruction. Correct way to estimate this number is to feed and keep the whole pipeline busy.

This is actually a bit crazy when you stop and think about it. Nowadays CPUs are packing more and more cores per die at somewhat increasing clock frequencies so they are actually coming quite close to the GPUs.

I mean, top of the line Nvidia H100 can sustain ~30 to ~60 TFLOPS whereas Zen 5 with 192 cores can do only half as much, ~15 to ~30 TFLOPS. This is not even a 10x difference.

I agree! I think people are used to comparing to a single threaded execution of non-vectorized code, which is using .1% of a modern CPU's compute power.

Where the balance slants all the way towards gpus again is the tensor units using reduced precision...