Hacker News new | ask | show | jobs
by danielhanchen 508 days ago
That's a fair point - the trick with dynamic quants is we selectively choose not to quantize many components - ie attention is left at 4 or 6bit, just the MoE parts are 1.5bit (-1, 0, 1)

There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.

But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!