|
|
|
|
|
by brucethemoose2
1048 days ago
|
|
exLlama is not the only viable quantized backend. TVM (as use by mlc-llm) and GGML (which is used by llama.cpp) are very strong contenders. ~7B-13B will work in 16GB RAM with pretty much any dGPU for help, and context extending tricks. TBH I suspect Stability released a 3B model because its cheap and quick to train. If they really wanted a good model on modest devices, they would have re used a supported architecture (like Falcon, MPT, Llama, Starcoder...) or contributed support to a good backend. *Also, I think any PyTorch based model is not really viable for consumer use. Its just too finicky to install and too narrow with hardware support. |
|