Hacker News new | ask | show | jobs
by apitman 294 days ago
If you want to run small models fast get the 5090. If you want to run large models slow get the Spark. If you want to run small models slow get a used MI50. If you want to run large models fast get a lot more money.
1 comments

You might be able to do "large models slow" better than the spark with a 5090 and CPU offload, so long as you stick with MoE architectures. With the kv cache and shared parts of the model on GPU and all of the experts on CPU, it can work pretty well. I'm able to run ~400GB models at 10 tps with some A4000s and a bunch of RAM. That's on a Xeon W system with poor practical memory bandwidth (~190GB/s), you can do better with EPYC.