Hacker News new | ask | show | jobs
by celrod 11 hours ago
What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.
1 comments

As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.

[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...

27B seems surprisingly resiliant to quantisation. Though my evals showed there was some impact to coding ability from 8 bit to 4 bit, it was less than I would've expected: and it was on task types that you've said above that you don't really do with these!