Hacker News new | ask | show | jobs
by moffkalast 1141 days ago
Well for running the average model as-is without spending a few days figuring out why you're getting strange errors and can't get it working you more or less need CUDA support.

As much VRAM as you can get is probably also a good idea.

For reference I can seemingly run Vicuna-7B (I think the 4 bit version) on my 6G 1660 Ti at roughly 1.5 tokens per second. Way too slow for anything useful, so you can imagine what CPU inference would look like.

1 comments

CPU inference is only a little slower. GPU's aren't good for a batch size of 1 and everything quantised.
I get 3 tokens per second on M1 Max running 30B models compared to 1 token per second on a GPU (P40), both quantized to 4bit. So, in my opinion CPUs are better for inference (at least fast CPUs with DDR 5 versus cheapest GPUs).

The reason why GPUs seem to be the standard de facto is that they scale better, are more power efficient and are better supported by pytorch & co. Also, academia cares more about getting the best quality for their benchmarks, than about the performance and accessibility.

GPU's win for training... And those who write papers and publish code tend to do lots of training and only a little inference.