Hacker News new | ask | show | jobs
by regularfry 876 days ago
Mixtral and others are often distributed as 16-bit floats, so that chops the problem in half immediately, but then it turns out that LLMs only have about four bits per parameter of actual information stored. There's a lot of redundancy. The ideal quantisation scheme would only throw away useless data, but no quantisation scheme is perfect so they inevitably harm the model somehow.

You've then got to remember that one thing neural networks are very, very good at is being noise tolerant. In some senses that's all they are - noise correction systems. The inaccuracies introduced by quantisation are "just" a sort of noise, so it's not surprising that they aren't fatal. It just raises the noise floor and gives the model more ways to be wrong.

Finally the thing to know is that these quantisation schemes don't do a naive "chop each number down to two bits", not exactly. Simplifying a bit, for each parameter in this example they'd try to find a mapping from a two-bit index into a four element lookup table of higher-precision values such that the information destroyed by replacing the original parameter by the lookup value is minimised. That mapping is calculated across small blocks of parameters, rather than across the entire model, so it can preserve local detail. The lookup table gets stored per block, which throws the compression ratio off a little.