Hacker News new | ask | show | jobs
by execveat 1135 days ago
I get 3 tokens per second on M1 Max running 30B models compared to 1 token per second on a GPU (P40), both quantized to 4bit. So, in my opinion CPUs are better for inference (at least fast CPUs with DDR 5 versus cheapest GPUs).

The reason why GPUs seem to be the standard de facto is that they scale better, are more power efficient and are better supported by pytorch & co. Also, academia cares more about getting the best quality for their benchmarks, than about the performance and accessibility.

1 comments

GPU's win for training... And those who write papers and publish code tend to do lots of training and only a little inference.