| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BoorishBears 462 days ago

I do? I spend a ton of time post-training models for creative tasks.

The effects of model quantization are usually qualified in terms of performance on benchmaxxed tasks with strong logit probabilities, temp 0, and a "right" answer the model has to pick. Or even worse they'll be measured on metrics that don't map to anything except themselves like perplexity (https://arxiv.org/pdf/2407.09141)

I agree Q8 is strong but I also think the effects of quantization are constantly being underappreciated. People are often talking about how these models perform while fundamentally using 10+ variants of a single model with distinct performance profiles.

Even knowing the bits per weight used isn't enough to know how exactly a given quant method is affecting the model: https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs

2 comments

imtringued 462 days ago

If you've trained your own models you would be aware of quantization aware training.

link

danielmarkbruce 462 days ago

"Nobody really cares if it meets a strict definition of lossless" != "quantization can be done haphazardly."

link

BoorishBears 462 days ago

If you're trying to really snarkily refer to the article on Dynamic Quants 2.0 and how carefully developed they were, they're comparing their quants to the methodology 99.99% quants out there use.

The problem is not that people are making quants "haphazardly", it's that people keep parroting that various quants are "practically lossless" when they actually have absolutely no clue how lossy they are given how application specific the concept is for something as multidimensional as an LLM.

The moment anyone tries a little harder to quantify how lossy they are, we repeatedly find that the answer is "not any reasonably definition of lossless". Even in their example where Q4 is <1% away in MMLU 5-shot is probably massively helped by a calibration dataset that maps to MMLU-style tasks really well, just like constantly using WikiText massively helps models that were trained on... tons of text from Wikipedia.

So unless you're doing your own calibrated quantization with your own dataset (which is not impossible, but also not near common), even their "non-haphazard" method could have a noticeable impact on performance.

link

danielmarkbruce 462 days ago

Wasn't referring to that.

You are saying that people are using quantized models haphazardly and talking about them haphazardly. I'll grant it's not the exact same thing as making them haphazardly, but I think you took the point.

The terms shouldn't be used here. They aren't helpful. You are either getting good results or you are not. It shouldn't be treated differently from further training on dataset d. The weights changed - how much better or worse at task Y did it just get?

link

BoorishBears 462 days ago

The term is perfectly fine to use here because choosing a quantization strategy to deploy already has enough variables:

- quality for your specific application

- time to first token

- inter-token latency

- memory usage (varies even for a given bits per weight)

- generation of hardware required to run

Of those the hardest to measure is consistently "quality for your specific application".

It's so hard to measure robustly that many will take significantly worse performance on the other fronts just to not have to try to measure it... which is how you end up with full precision deployments of a 405b parameter model: https://openrouter.ai/meta-llama/llama-3.1-405b-instruct/pro...

When people are paying multiples more for compute to side-step a problem, language and technology that allows you to erase it from the equation is valid.

link

danielmarkbruce 462 days ago

You say that as though people know these things for the full precision deployment and their use case.

Some have the capability to figure it and can do it for both full precision and quantized. Most don't and cannot.

link