Hacker News new | ask | show | jobs
by int_19h 1139 days ago
A very broad answer is that large NNs are surprisingly resilient to inaccuracies, and it seems to be more pronounced as size grows larger. This is readily observable with LLaMA, where 4-bit quantization affects 7B worst of all.

Furthermore, model size is still the most significant contributor to output quality. E.g. vanilla llama-30b at 4-bit has better perplexity than any llama-13b finetune at 8-bit. Thus, if 4-bit lets you fit a larger model into available (V)RAM, you're still better off.

This is also why analog computing is seriously considered as a hardware architecture for LLMs: if you don't actually need bit-perfect matmul for things to work well, it can be done much simpler as an analog circuit, and then you can cram a lot more of them on the same chip. Any resulting quality loss would presumably be minor, and in any case would be more than compensated by the much larger model sizes allowed by such architecture.