| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by WhitneyLand 149 days ago

Its often pointed out in the first sentence of a comment how a model can be run at home, then (maybe) towards the end of the comment it’s mentioned how it’s quantized.

Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible.

The degree of quality loss is not often characterized. Which makes sense because it’s not easy to fully quantify quality loss with a few simple benchmarks.

By the time it’s quantized to 4 bits, 2 bits or whatever, does anyone really have an idea of how much they’ve gained vs just running a model that is sized more appropriately for their hardware, but not lobotomized?

5 comments

zozbot234 149 days ago

> ...Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible. ...

int4 quantization is the original release in this case; it's not been quantized after the fact. It's a bit of a nuisance when running on hardware that doesn't natively support the format (might waste some fraction of memory throughput on padding, specifically on NPU hw that can't do the unpacking on its own) but no one here is reducing quality to make the model fit.

link

WhitneyLand 149 days ago

Good point thanks for the clarification.

The broader point remains though which is, “you can run this model as home…” when actually the caveats are potentially substantial.

It would be so incredibly slow…

link

FuckButtons 149 days ago

From my own usage, the former is almost always better than the latter. Because it’s less like a lobotomy and more like a hangover, though I have run some quantized models that seem still drunk.

Any model that I can run in 128 gb in full precision is far inferior to the models that I can just barely get to run after reap + quantization for actually useful work.

I also read a paper a while back about improvements to model performance in contrastive learning when quantization was included during training as a form of perturbation, to try to force the model to reach a smoother loss landscape, it made me wonder if something similar might work for llms, which I think might be what the people over at minimax are doing with m2.1 since they released it in fp8.

In principle, if the model has been effective during its learning at separating and compressing concepts into approximately orthogonal subspaces (and assuming the white box transformer architecture approximates what typical transformers do), quantization should really only impact outliers which are not well characterized during learning.

link

WhitneyLand 149 days ago

Interesting.

If this were the case however, why would labs go through the trouble of distilling their smaller models rather than releasing quantized versions of the flagships?

link

petu 148 days ago

You can't quantize 1T model down to "flash" model speed/token price. 4bpw is about the limit of reasonable quantization, so 2-4x (fp8/16 -> 4bpw) weight size reduction. Easier to serve, sure, but maybe not offer as free tier cheap.

With distillation you're training new model, so size of it is arbitrary, say 1T -> 20B (50x) reduction which also can be quantized. AFAIK distillation is also simply faster/cheaper than training from scratch.

link

dabockster 149 days ago

Hanlon's razor.

"Never attribute to malice that which is adequately explained by stupidity."

Yes, I'm calling labs that don't distill smaller sized models stupid for not doing so.

link

codexon 149 days ago

Didn't this paper demonstrate that you only need 1.58 bits to be equivalent to 16 bits in performance?

https://arxiv.org/abs/2402.17764

link

Ey7NFZ3P0nzAe 149 days ago

This technique showed that there are ways during training to optimize weights to neatly quantize while remaining performant. This isn't a post training quantization like int4.

link

WhitneyLand 148 days ago

For Kimi quantization is part of the training also. Specifically they say they use QAT, quantization aware training.

That doesn't mean training with all integer math, but certain tricks are used to specifically plan for the end weight size. I.e. fake quantization nodes are inserted to simulate int4.

link

WhitneyLand 149 days ago

Iirc the paper was solid, but it still hasn’t been adopted/proven out at large scale. Harder to adapt hardware and code kernels to something like this compared to int4.

link

RandomTeaParty 148 days ago

just call it one trit

link

Gracana 149 days ago

The level of deceit you're describing is kind of ridiculous. Anybody talking about their specific setup is going to be happy to tell you the model and quant they're running and the speeds they're getting, and if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around.

link

jasonjmcghee 149 days ago

> if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around

Fwiw, not necessarily. I've noticed quantized models have strange and surprising failure modes where everything seems to be working well and then does a death spiral repeating a specific word or completely failing on one task of a handful of similar tasks.

8-bit vs 4-bit can be almost imperceptible or night and day.

This isn't something you'd necessarily see playing around, but when trying to do something specific

link

selfhoster11 149 days ago

Except the parent comment said you can stream the weights from an SSD. The full weights, uncompressed. It takes a little longer (a lot longer), but the model at least works without lossy pre-processing.

link