Hacker News new | ask | show | jobs
by sbierwagen 1086 days ago
A group at IBM has been working on minifloat training for a while. Here's a paper from 2020 on FP4 training: https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8...
2 comments

Their best performing 4-bit number format uses 1 sign bit, 3 exponent bits, and no mantissa bits!

Ie. All weights, activations and gradients become powers of two! Which means all multiplications become simple bit shifts. That really changes mathematics and silicon design.

Does it really make much of a difference?

You're usually feeding a ton of multiplies into an accumulator. You can handle one or two mantissa bits as the same bit shifting except that it outputs two or three numbers to accumulate. And accumulators are very easy to scale.

Also in the extreme I've seen powers of 4 get used.

At just 4 bits, there are only 16 possible numbers. It becomes lookup table territory - and there is no need to have the numbers on your numberline be linearly or exponentially spaced - you can assign them arbitarily. For example, you could have a number system consisting of: (+-) 0.5, 1, 2, 3, 5, 10, 1000, 1000000 - getting some nice accuracy in the middle of the number line where you expect most values to lie, plus some extreme values so convergence doesn't take forever if some big activation/gradient needs to be propagated.
The more recent 4 bit quantizations are almost along these lines. Q4_1 in ggml for example takes a block of 32 weights and gives each block a scaling factor 'd' and takes the minimum of the weights 'm' to be the quantized '0', so the final weights from a quantized weight 'q' is q * d + m, and taking a relatively small block size makes it more likely that those are all within a reasonable quantization range. Notably, d and m can be stored with more accuracy without sacrificing too much space, since the overhead is divided by 32. Q4_k goes a bit further, and takes 'superblocks' of 8 blocks, and applies another scaling factor 'd_s' and minimum 'm_s' to that, so the final weight is (q * d + m) * d_s + m_s, and the additional factors are stored as 6 bits instead of 4.

In practice this seems to get very good results, while being cheap to implement and relatively space efficient, Q4_K for example works out to 4.5 bits per weight instead of 4. The PR adding it has more details: https://github.com/ggerganov/llama.cpp/pull/1684

Very efficient for storage and memory bandwidth, but such a scheme is a headache for high throughput hardware implementations (at least compared to regular 4 bit math, which can be packed really really densely)
Also I would highly recommend Q5_K_M for both 7B and 13B models.

It has the best balance between quality and weight of the model and almost indistinguishable from original f16: https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...

This is an excellent explanation, thank you!!