| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by beck5 875 days ago
	Dumb question, but how can a 32 bit number be converted to 2 bits and still be useful? It seems like magic.

8 comments

regularfry 875 days ago

Mixtral and others are often distributed as 16-bit floats, so that chops the problem in half immediately, but then it turns out that LLMs only have about four bits per parameter of actual information stored. There's a lot of redundancy. The ideal quantisation scheme would only throw away useless data, but no quantisation scheme is perfect so they inevitably harm the model somehow.

You've then got to remember that one thing neural networks are very, very good at is being noise tolerant. In some senses that's all they are - noise correction systems. The inaccuracies introduced by quantisation are "just" a sort of noise, so it's not surprising that they aren't fatal. It just raises the noise floor and gives the model more ways to be wrong.

Finally the thing to know is that these quantisation schemes don't do a naive "chop each number down to two bits", not exactly. Simplifying a bit, for each parameter in this example they'd try to find a mapping from a two-bit index into a four element lookup table of higher-precision values such that the information destroyed by replacing the original parameter by the lookup value is minimised. That mapping is calculated across small blocks of parameters, rather than across the entire model, so it can preserve local detail. The lookup table gets stored per block, which throws the compression ratio off a little.

link

DougBTX 875 days ago

Nice graphs here: https://github.com/ggerganov/llama.cpp/pull/1684

So for example, 2 bit version of the 30B is much worse than the original, but still better than the 13B model.

Also, there are lots of extra details, eg, not all of the weights are 2 bit, and even the 2 bit weights are higher than that overall as groups of quantised weights share scale factors stored elsewhere.

link

beefield 875 days ago

I think of it with this kind of analogy: the original image is stored with 32 bit color scheme. You can reduce the color scheme to 16 bit accuracy and still figure out pretty well what the image is about. 2 bit is stretching this to a bit far, basically either pixel is white or it is black, but even if you lose lots of nuances in the image, in many images even that gives you some idea whats going on in the image.

link

DougBTX 875 days ago

That’s an interesting question, I wonder if there is an analogy in quantisation to image dithering?

link

hnfong 874 days ago

This blog post might shed some light on the matter. If I'm understanding it correctly, it claims there are emergent features on the LLM weights that make it easier to "compress" the floats into smaller bits without losing much precision.

https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...

Note that 2 bit quantization is generally regarded as too aggressive. Generally 4bits+ achieves a good tradeoff, see eg. https://arxiv.org/abs/2212.09720

link

brucethemoose2 875 days ago

Its not really 2 bits.

Modern quantization schemes are almost like lossy compression algorithms, and llms in particular are very "sparse" and amenable to compression.

link

ttoinou 875 days ago

All the 32 bits weren't necessarily used, and it's the whole network itself that has to be useful. It's a tradeoff. We started with very good precision to test the new method, now we can optimize some parts of it

link

Const-me 874 days ago

Here’s an example of a custom 4 bits/weight codec for ML weights:

https://github.com/Const-me/Cgml/blob/master/Readme.md#bcml1...

llama.cpp does it slightly differently but still, AFAIK their quantized data formats are conceptually similar to my codec.

link

sp332 875 days ago

The extra precision is more useful for training. Once the network is optimized, it's a statistical model and only needs enough precision to make good guesses. In fact, one of the big papers on this also pointed out that you can drop about 40% of the weights completely. I think people generally skip that part because sparse matrix operations are slower, so it doesn’t help here.

link

viraptor 875 days ago

For models with dropped weights, the keyword is "distilled". For example ssd-1b is a 50% size version of Stable Diffusion XL (https://huggingface.co/segmind/SSD-1B)

link

sp332 875 days ago

That’s crazy, I’ve never seen one that dropped whole layers from a pre-trained model. I guess that avoids the sparse matrix math.

link