|
|
|
|
|
by FeepingCreature
561 days ago
|
|
I think all the people saying "just use a CPU" massively underestimate the speed difference between current CPUs and current GPUs. There's like four orders of magnitude. It's not even in the same zip code. Say you have a 64-core CPU at 2Ghz with 512-bit 1-cycle FP16 instructions. That gives you 32 ops per cycle, 2048 across the entire package, so 4TFlops. My 7900 XTX does 120TFlops. To match that, you would need to scale that CPU up to either 2048 cores, 2KB per register (still one-cycle!) or 64Ghz. I guess if you had 1024-bit registers and 8Ghz, you could get away with only 240 cores. Good luck thermal dissipating that btw. To reverse an opinion I'm seeing in this thread, at that point your CPU starts looking more like a GPU by necessity. |
|
For inference, prompt processing is compute intensive, while token generation is memory bandwidth bound. The differences in memory bandwidth between CPUs and GPUs tend to be more profound than the difference in compute.