|
|
|
|
|
by michaelt
509 days ago
|
|
Personally I've noticed major changes in performance between different quantisations of the same model. Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model. This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model. |
|
There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.
But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!