I've tried it on 10980XE (18-core) that got between 600GFlops-1.6TFlops depending on the instruction in quad channel mode. Will try later on a 32-core Threadripper. The challenge there is to keep all cores busy during training while not repeating the same gradient computation I guess (both scheduling and memory stuff).
Those are Tensor flops, the numbers for the Zen CPU are "general-purpose" flops (sometimes called "vector flops" in marketing material).
The vector flops for the 3090Ti are 33 TFlops for single precision, 0.5 TFlops for double precision. So, 16x faster than the 5950x in single precision, 2x slower for double precision. At almost 3x the price and >4x the power consumption.
Of course, if all you care about is AI, then there's no argument - but then we are not really talking about a general-purpose device any more.
The narrative of GPUs being "hundreds of time" faster than CPUs is vastly blown out of proportion for general-purpose computing.
I think you missed that this whole discussion is in the context of deep learning, therefore your comment does not apply. It is 30x slower that 3090Ti for that purpose.
Here's the comment I assume you are allegedly trying to "correct":
> with full training you are out of luck with CPUs, the gap is much bigger. 64c TR could only get to roughly 1TFlops
1TFlops is not the main part of that statement, and it is qualified with "roughly" which I suppose is not too far from the truth in the context. And the context is "training ... the gap is much bigger", and in this case "much" is at least 30x even with the updated number.
https://github.com/Mysticial/Flops/
You can also get a theoretical computation of the Flops, which matches nicely with the experimental measurement. You have to take into account:
- the clock frequency (~3.9 GHz on multithreaded workloads on my machine)
- the number of cores (16)
- the reciprocal throughput of the FMA instruction (~.5, that is, 2 instructions per clock cycle)
- the number of flops per instruction (2 for the FMA instruction, that is, 1 multiply + 1 add)
- the SIMD vector width (4 for double, 8 for float).
Putting it together:
3.9e9 * 16 * 2 * 2 * 4 = 998.4 GFlops (double)
3.9e9 * 16 * 2 * 2 * 8 = 1996.8 GFlops (single)
The measured values on my machine are a bit different, but close (1070 and 2151 respectively).
References:
https://www.agner.org/optimize/instruction_tables.pdf
https://www.agner.org/forum/viewtopic.php?t=56
https://gadgetversus.com/processor/amd-ryzen-9-5950x-gflops-...