|
|
|
|
|
by jairuhme
509 days ago
|
|
At my work, we self-host some models and have found that for anything remotely similar to RAG or use cases that are very specific, the quantized models have proven to be more than sufficient. This helps us keep them running on smaller infra and generally lower costs |
|
Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model.
This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.