Hacker News new | ask | show | jobs
by londons_explore 1141 days ago
CPU inference is only a little slower. GPU's aren't good for a batch size of 1 and everything quantised.
1 comments

I get 3 tokens per second on M1 Max running 30B models compared to 1 token per second on a GPU (P40), both quantized to 4bit. So, in my opinion CPUs are better for inference (at least fast CPUs with DDR 5 versus cheapest GPUs).

The reason why GPUs seem to be the standard de facto is that they scale better, are more power efficient and are better supported by pytorch & co. Also, academia cares more about getting the best quality for their benchmarks, than about the performance and accessibility.

GPU's win for training... And those who write papers and publish code tend to do lots of training and only a little inference.