Hacker News new | ask | show | jobs
by londons_explore 1088 days ago
At just 4 bits, there are only 16 possible numbers. It becomes lookup table territory - and there is no need to have the numbers on your numberline be linearly or exponentially spaced - you can assign them arbitarily. For example, you could have a number system consisting of: (+-) 0.5, 1, 2, 3, 5, 10, 1000, 1000000 - getting some nice accuracy in the middle of the number line where you expect most values to lie, plus some extreme values so convergence doesn't take forever if some big activation/gradient needs to be propagated.
1 comments

The more recent 4 bit quantizations are almost along these lines. Q4_1 in ggml for example takes a block of 32 weights and gives each block a scaling factor 'd' and takes the minimum of the weights 'm' to be the quantized '0', so the final weights from a quantized weight 'q' is q * d + m, and taking a relatively small block size makes it more likely that those are all within a reasonable quantization range. Notably, d and m can be stored with more accuracy without sacrificing too much space, since the overhead is divided by 32. Q4_k goes a bit further, and takes 'superblocks' of 8 blocks, and applies another scaling factor 'd_s' and minimum 'm_s' to that, so the final weight is (q * d + m) * d_s + m_s, and the additional factors are stored as 6 bits instead of 4.

In practice this seems to get very good results, while being cheap to implement and relatively space efficient, Q4_K for example works out to 4.5 bits per weight instead of 4. The PR adding it has more details: https://github.com/ggerganov/llama.cpp/pull/1684

Very efficient for storage and memory bandwidth, but such a scheme is a headache for high throughput hardware implementations (at least compared to regular 4 bit math, which can be packed really really densely)
Also I would highly recommend Q5_K_M for both 7B and 13B models.

It has the best balance between quality and weight of the model and almost indistinguishable from original f16: https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...

This is an excellent explanation, thank you!!