Hacker News new | ask | show | jobs
by frognumber 1362 days ago
I recently ran some machine learning benchmarks on CPU versus GPU. The gap for a multicore CPU was much smaller than I expected. I, for one, am excited by a 64-core CPU in a way I wouldn't have been a year ago.

Between Hugging Face, Stable Diffusion, and Whisper, I'm using ML workloads a lot more. Being able to do so:

* with a standard instruction set

* with open-source software

* with my full system RAM

* without having to worry about what is in VRAM versus main RAM

is a big step up. I see about a 10x speed difference between an older 16-core CPU and a hot-off-the-press high-end Ampere card costing 3x as much as the CPU. If 64 core could bring that within 2x, or even 4x, I'd dump the GPU entirely.

2 comments

Inference or training? I think with full training you are out of luck with CPUs, the gap is much bigger. 64c TR could only get to roughly 1TFlops.
Eh?

My 5950x (measured) flops are ~2 TFLOPS in single-precision, ~1TFLOPS in double precision (obviously, due to half the SIMD vector size). This is a desktop-class 16-core machine.

Go and measure it yourself, if you have one :)

https://github.com/Mysticial/Flops/

You can also get a theoretical computation of the Flops, which matches nicely with the experimental measurement. You have to take into account:

- the clock frequency (~3.9 GHz on multithreaded workloads on my machine)

- the number of cores (16)

- the reciprocal throughput of the FMA instruction (~.5, that is, 2 instructions per clock cycle)

- the number of flops per instruction (2 for the FMA instruction, that is, 1 multiply + 1 add)

- the SIMD vector width (4 for double, 8 for float).

Putting it together:

3.9e9 * 16 * 2 * 2 * 4 = 998.4 GFlops (double)

3.9e9 * 16 * 2 * 2 * 8 = 1996.8 GFlops (single)

The measured values on my machine are a bit different, but close (1070 and 2151 respectively).

References:

https://www.agner.org/optimize/instruction_tables.pdf

https://www.agner.org/forum/viewtopic.php?t=56

https://gadgetversus.com/processor/amd-ryzen-9-5950x-gflops-...

I've tried it on 10980XE (18-core) that got between 600GFlops-1.6TFlops depending on the instruction in quad channel mode. Will try later on a 32-core Threadripper. The challenge there is to keep all cores busy during training while not repeating the same gradient computation I guess (both scheduling and memory stuff).
2 TFlops or 5 TFlops does not matter much. 3090Ti does 160 TFlops, e.g. at least 30x (!) times faster.
Those are Tensor flops, the numbers for the Zen CPU are "general-purpose" flops (sometimes called "vector flops" in marketing material).

The vector flops for the 3090Ti are 33 TFlops for single precision, 0.5 TFlops for double precision. So, 16x faster than the 5950x in single precision, 2x slower for double precision. At almost 3x the price and >4x the power consumption.

Of course, if all you care about is AI, then there's no argument - but then we are not really talking about a general-purpose device any more.

The narrative of GPUs being "hundreds of time" faster than CPUs is vastly blown out of proportion for general-purpose computing.

Zen4 will do 16bit BFloat FP, so one would expect it to do a lot better than threadripper on ML training applications?
This Threadripper is Zen 4.
Fair enough. My benchmark was inference. I care about inference much more than training for most of what I do.
Recently, I've been implementing my custom inference code in C for various models (GPT, Whisper) and am interested to see how it compares to various GPUs in terms of performance. So far, I've been running it only on my MacBook M1 as I don't have the necessary hardware.