Hacker News new | ask | show | jobs
by londons_explore 1087 days ago
It's a shame that large language models are mostly moving to 4 bit weights for inference, and a bunch of papers have shown promising techniques for training in 4 bit too...

Remember that switching from 16 bit to 4 bit lets you have 4x as many weights, 4x as many weights loaded from RAM per second, and ~1/16 of the silicon area for the calculations (a multiplier scales with approximately the number of bits squared). That smaller silicon area will let you do more per $ too...

4 comments

There is some overhead from the quantization, and right now the operations themself are sometimes done at higher precision than the weights in RAM.

And widespread hardware 4 bit will take some time. If the HW makers started designing 4 bit silicon in 2022, then we are still years away.

What?! Can you also train with quantization? Incredible! I'd have thought the gradients were way too ugly for any convergence with 4 bits.

Any particularly good papers you can recommend me on the topic?

Here's a recent paper on training transformers with 4 bit integer weights.

https://arxiv.org/abs/2306.11987

A group at IBM has been working on minifloat training for a while. Here's a paper from 2020 on FP4 training: https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8...
Their best performing 4-bit number format uses 1 sign bit, 3 exponent bits, and no mantissa bits!

Ie. All weights, activations and gradients become powers of two! Which means all multiplications become simple bit shifts. That really changes mathematics and silicon design.

Does it really make much of a difference?

You're usually feeding a ton of multiplies into an accumulator. You can handle one or two mantissa bits as the same bit shifting except that it outputs two or three numbers to accumulate. And accumulators are very easy to scale.

Also in the extreme I've seen powers of 4 get used.

At just 4 bits, there are only 16 possible numbers. It becomes lookup table territory - and there is no need to have the numbers on your numberline be linearly or exponentially spaced - you can assign them arbitarily. For example, you could have a number system consisting of: (+-) 0.5, 1, 2, 3, 5, 10, 1000, 1000000 - getting some nice accuracy in the middle of the number line where you expect most values to lie, plus some extreme values so convergence doesn't take forever if some big activation/gradient needs to be propagated.
The more recent 4 bit quantizations are almost along these lines. Q4_1 in ggml for example takes a block of 32 weights and gives each block a scaling factor 'd' and takes the minimum of the weights 'm' to be the quantized '0', so the final weights from a quantized weight 'q' is q * d + m, and taking a relatively small block size makes it more likely that those are all within a reasonable quantization range. Notably, d and m can be stored with more accuracy without sacrificing too much space, since the overhead is divided by 32. Q4_k goes a bit further, and takes 'superblocks' of 8 blocks, and applies another scaling factor 'd_s' and minimum 'm_s' to that, so the final weight is (q * d + m) * d_s + m_s, and the additional factors are stored as 6 bits instead of 4.

In practice this seems to get very good results, while being cheap to implement and relatively space efficient, Q4_K for example works out to 4.5 bits per weight instead of 4. The PR adding it has more details: https://github.com/ggerganov/llama.cpp/pull/1684

Very efficient for storage and memory bandwidth, but such a scheme is a headache for high throughput hardware implementations (at least compared to regular 4 bit math, which can be packed really really densely)
Also I would highly recommend Q5_K_M for both 7B and 13B models.

It has the best balance between quality and weight of the model and almost indistinguishable from original f16: https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...

This is an excellent explanation, thank you!!
I dimly remember reading that the mathematical compute-per-density optimum is around 3.x bits in a „brain like structure“, I don’t remember any details though or the precise context. Does this ring a bell with anyone?
Is it possible we will we eventually see 1-bit weights in use?
There are already papers on it, and there is 2-bit quant in llama.cpp.

But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.

There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.

For some time I played with Brevitas and Xilinx's FINN, you could quantize like crazy. I haven't looked since transformers took over the AI world where they were.