| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by FeepingCreature 561 days ago

I think all the people saying "just use a CPU" massively underestimate the speed difference between current CPUs and current GPUs. There's like four orders of magnitude. It's not even in the same zip code. Say you have a 64-core CPU at 2Ghz with 512-bit 1-cycle FP16 instructions. That gives you 32 ops per cycle, 2048 across the entire package, so 4TFlops.

My 7900 XTX does 120TFlops.

To match that, you would need to scale that CPU up to either 2048 cores, 2KB per register (still one-cycle!) or 64Ghz.

I guess if you had 1024-bit registers and 8Ghz, you could get away with only 240 cores. Good luck thermal dissipating that btw. To reverse an opinion I'm seeing in this thread, at that point your CPU starts looking more like a GPU by necessity.

2 comments

ryao 561 days ago

Usually, you can do 2 AVX-512 operations per cycle and using FMADD (fused multiply-add) instructions, you can do two floating point operations for the price of one. That would be 128 operations per cycle per core. The result would be 16TFlops on a 2GHz 64 core CPU, not 4 TFlops. This would give a 1 order of magnitude difference, rather than 4 orders of magnitude.

For inference, prompt processing is compute intensive, while token generation is memory bandwidth bound. The differences in memory bandwidth between CPUs and GPUs tend to be more profound than the difference in compute.

link

FeepingCreature 561 days ago

That's fair. On the other hand, there's like exactly one CPU with FP16 AVX512 anyways, and 64core aren't exactly commonplace either. And even with all those advantages, using a datacenter CPU, you're still a factor of 10 off from a GPU that isn't even consumer top-end. With a normal processor, say 16 cores, 16 float ops, even with fused ops and dispatching two ops per cycle you're still only at 2T and ~50x. In consumer spaces, I'm more optimistic about dedicated coprocessors. Maybe even iTPU?

link

ryao 561 days ago

Zen 6 is supposed to add FP16 AVX512 support if AMD’s leaked slides mean what I think they mean. Here is a link to a screenshot of the leaked slides MLID published:

https://overclockers.ru/st/legacy/blog/428111/424644_O.jpg

I have been working on doing inference on a Ryzen 7 5800X lately and I have had good results:

https://github.com/ryao/llama3.c/blob/master/run.c

Running on a GPU like my 3090 Ti will likely outperform it by two orders of magnitude, but I have managed to push the needle slightly on the state of the art performance for prompt processing on my CPU. I suspect an additional 15% improvement is possible, but I do not expect to be able to realize it. In any case, it is an active R&D project that I am doing to learn how these things work.

Finally to answer your question, I have no good answers for you (or more specifically, answers that I like). I have been trying to think of ways to do fast local inference on high end models cost effectively for months. So far, I have nothing to show for it aside from my R&D into CPU llama 3 inference since none of my ideas are able to bring hardware costs needed for llama 3.1 405B below $10,000 with performance at an acceptable level. My idea of an acceptable performance level is 10 tokens per second for token generation and 4000 tokens per second for prompt processing, although perhaps lower prompt processing performance is acceptable with prompt caching.

link

imtringued 561 days ago

This is only relevant for the flash attention part of the transformer, but a NPU is an equally suitable replacement for a GPU for flash attention.

Once you have offloaded flash attention, you're back to GEMV having a memory bottleneck. GEMV does a single multiplication and addition per parameter. You can add as many EXAFLOPs as you want, it won't get faster than your memory.

link

FeepingCreature 561 days ago

Out of interest, how does that look for diffusion?

link