Hacker News new | ask | show | jobs
by fxtentacle 943 days ago
You'd be surprised how far that takes you. I mean I was truly astonished when I saw that a GptNeoX LLM quantized down to 1.5 bits per value at 90% sparsity was still producing acceptable predictions. But the size went from multiple GBs to less than 1 MB of (compressed) parameters.
2 comments

Any link to this? I actually haven't seen any reported results on less than 2 bits.
Nothing public, sorry. I do consulting on how to convert AIs from CUDA to C++ to save money. With a good quantization, you can sometimes replace a $19k A100 with a $0.5k EPYC. And especially for apps and/or WebGL interference, you want small models.

Anyway, if you quantize to -1, 0, or +1 and then use arithmetic coding, you come out at around 1.58 bits per parameter. And then by skewing the distribution with forced sparsity, you have something like 5% x -1, 90% x 0, 5% x +1 which comes out at about 0.6 bits per parameter after arithmetic coding.

I used that on "gpt_neox.layers.*.mlp.dense_h_to_4h.weight" (HuggingFace PyTorch implementation), for example. But for other layers you need more bits. For example, I could never get gpt_neox.embed_in.weight to less than 2% -2, 8% -1, 80% 0, 8% +1, 2% +2 which comes out at around 1.1 bits per parameter [1]. And then stuff like gpt_neox.layers.0.attention.query_key_value.weight will drive up your overall bits per parameter because those are very difficult to quantize or sparsify. That 1.5 was the average over the entire model and some layers compress even better while others compress worse.

[1] example calculation: https://www.wolframalpha.com/input?i=-%28log2%280.02%29*0.02...

Is it possible to get good performance in computation when encoding the data this way, or is there a lot of cycles lost to packing and unpacking these bits?
It's actually much faster if you're limited by RAM bandwidth because instead of doing float x float mul, which requires 8 bytes of load and 4 bytes of store, you do an int8 x int8 mul with 2 bytes in and 1 byte out. And typically for a quantized LNN like this, you'd only do packing and unpacking before or after a matmul on the low-dimensional vectors so that you can directly use the quantized weights.

E.g. you quantize a 512-float activation to 512-int8, then matmul with 512x4096, Gelu, 4096x512 all in int8, then de-quantize to 512-float. That means no quantization overhead on those 4,194,304 parameters in your Dense layers.

do you know if the LLM was fine-tuned in any way to the sparsity & quantisation? Or did it just work out of the box?
I personally fine-tuned it with QAT = quantisation aware training and custom extensions to induce the sparsity.

https://pytorch.org/docs/stable/quantization.html#quantizati...