Hacker News new | ask | show | jobs
by bluescarni 1353 days ago
Go and measure it yourself, if you have one :)

https://github.com/Mysticial/Flops/

You can also get a theoretical computation of the Flops, which matches nicely with the experimental measurement. You have to take into account:

- the clock frequency (~3.9 GHz on multithreaded workloads on my machine)

- the number of cores (16)

- the reciprocal throughput of the FMA instruction (~.5, that is, 2 instructions per clock cycle)

- the number of flops per instruction (2 for the FMA instruction, that is, 1 multiply + 1 add)

- the SIMD vector width (4 for double, 8 for float).

Putting it together:

3.9e9 * 16 * 2 * 2 * 4 = 998.4 GFlops (double)

3.9e9 * 16 * 2 * 2 * 8 = 1996.8 GFlops (single)

The measured values on my machine are a bit different, but close (1070 and 2151 respectively).

References:

https://www.agner.org/optimize/instruction_tables.pdf

https://www.agner.org/forum/viewtopic.php?t=56

https://gadgetversus.com/processor/amd-ryzen-9-5950x-gflops-...

2 comments

I've tried it on 10980XE (18-core) that got between 600GFlops-1.6TFlops depending on the instruction in quad channel mode. Will try later on a 32-core Threadripper. The challenge there is to keep all cores busy during training while not repeating the same gradient computation I guess (both scheduling and memory stuff).
2 TFlops or 5 TFlops does not matter much. 3090Ti does 160 TFlops, e.g. at least 30x (!) times faster.
Those are Tensor flops, the numbers for the Zen CPU are "general-purpose" flops (sometimes called "vector flops" in marketing material).

The vector flops for the 3090Ti are 33 TFlops for single precision, 0.5 TFlops for double precision. So, 16x faster than the 5950x in single precision, 2x slower for double precision. At almost 3x the price and >4x the power consumption.

Of course, if all you care about is AI, then there's no argument - but then we are not really talking about a general-purpose device any more.

The narrative of GPUs being "hundreds of time" faster than CPUs is vastly blown out of proportion for general-purpose computing.

I think you missed that this whole discussion is in the context of deep learning, therefore your comment does not apply. It is 30x slower that 3090Ti for that purpose.
My initial comment was correcting a factually inaccurate statement regarding CPU performance.

It is you who barged into the thread with unrelated GPU performance numbers, but whatever :)

You are missing forest for the trees.

Here's the comment I assume you are allegedly trying to "correct":

> with full training you are out of luck with CPUs, the gap is much bigger. 64c TR could only get to roughly 1TFlops

1TFlops is not the main part of that statement, and it is qualified with "roughly" which I suppose is not too far from the truth in the context. And the context is "training ... the gap is much bigger", and in this case "much" is at least 30x even with the updated number.