I get 3 tokens per second on M1 Max running 30B models compared to 1 token per second on a GPU (P40), both quantized to 4bit. So, in my opinion CPUs are better for inference (at least fast CPUs with DDR 5 versus cheapest GPUs).
The reason why GPUs seem to be the standard de facto is that they scale better, are more power efficient and are better supported by pytorch & co. Also, academia cares more about getting the best quality for their benchmarks, than about the performance and accessibility.
The reason why GPUs seem to be the standard de facto is that they scale better, are more power efficient and are better supported by pytorch & co. Also, academia cares more about getting the best quality for their benchmarks, than about the performance and accessibility.