Hacker News new | ask | show | jobs
by brucethemoose2 1048 days ago
exLlama is not the only viable quantized backend. TVM (as use by mlc-llm) and GGML (which is used by llama.cpp) are very strong contenders.

~7B-13B will work in 16GB RAM with pretty much any dGPU for help, and context extending tricks.

TBH I suspect Stability released a 3B model because its cheap and quick to train. If they really wanted a good model on modest devices, they would have re used a supported architecture (like Falcon, MPT, Llama, Starcoder...) or contributed support to a good backend.

*Also, I think any PyTorch based model is not really viable for consumer use. Its just too finicky to install and too narrow with hardware support.