|
|
|
|
|
by quickthrower2
1139 days ago
|
|
I would love to see an article on why quantising to low bits works. Seems counterintuitive to me. For example do that with a CD and it will sound awful. It took smarts to come up with mp3
format rather than just reduce number of bits. |
|
Furthermore, model size is still the most significant contributor to output quality. E.g. vanilla llama-30b at 4-bit has better perplexity than any llama-13b finetune at 8-bit. Thus, if 4-bit lets you fit a larger model into available (V)RAM, you're still better off.
This is also why analog computing is seriously considered as a hardware architecture for LLMs: if you don't actually need bit-perfect matmul for things to work well, it can be done much simpler as an analog circuit, and then you can cram a lot more of them on the same chip. Any resulting quality loss would presumably be minor, and in any case would be more than compensated by the much larger model sizes allowed by such architecture.