| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by s_m_t 827 days ago
	How can 4 bits possibly be enough? Are intermediate calculations done at a higher width and then converted down back to FP4?

5 comments

WhitneyLand 827 days ago

- Training isn’t done at 4-bits, to date this small size has only been for inference.

- Research for a while now has been finding that smaller weights are surprisingly effective. It’s kind of a counterintuitive result, but one way to think about it is there are billions of weights working together. So taken as a whole you still have a large amount of information.

acchow 827 days ago

Intuitively, there is a ton of redundancy and we still have a long way we can still compress things.

imtringued 826 days ago

Each token is represented by a vector of 4096 floats. Of course there is redundancy.

tmalsburg2 826 days ago

> - Training isn’t done at 4-bits, to date this small size has only been for inference.

Wasn't there a paper from Microsoft two weeks ago or so where they trained on log₂(3) bits?

Edit: https://arxiv.org/pdf/2402.17764.pdf

terramex 826 days ago

They don't "train on log₂(3) bit". Gradients and activations are still calculated at full (8-bit) precision and weights are quantised after every update.

This makes network minimise loss not only with regard to expected outcome but also minimises loss resulting from quantisation. With big networks their "knowledge" is encoded in relationships between weights, not in their absolute values so lower precision work well as long as network is big enough.

coffeebeqn 827 days ago

Maybe the rounding errors are noise that is somewhat useful in a big enough neutral net. Image generators also generate noise to work on

yalok 827 days ago

There are research papers where even 1 bit (not floating point) was enough, with some quality loss.

4 bits is effectively 16 different float point numbers - 8 positive, 8 negative, no zero and no NaN/inf. 1 bit for sign and 3 bits for exponent, 0 bits for mantissa, mantissa is implied to be 4. It’s logarithmic - representing numbers in the range from -4^3 to 4^3, smallest numbers are 4^-3.

phh 826 days ago

Thanks. First source i see for what fp4 is. Gotta say I'm surprised: I would have chosen to lose one value, but have a zero. (though I have no doubt those people are much more clever and knowledgeable than I am)

omikun 826 days ago

If the weight is zero it doesn’t need to exist

carlmr 826 days ago

>1 bit (not floating point)

I like how you specified that it's not floating point.

s_m_t 827 days ago

Thanks, I was thinking that zero, negative zero, inf, negative inf, and the NaN's were included like in IEEE 754

anon291 827 days ago

The fundamental 'unit' of NN computation is not an individual vector element but rather an entire vector. One of the first results you often learn about in linear algebra is that some axes are more important than others (principal components, singular value decomposition). Thus, it totally stands to reason that the underlying field of the vector is inconsequential but rather the entire vector machinery. All you have to do is make sure that there are enough elements in the vector to get the job done for whatever bit size of element.

s_m_t 827 days ago

I see, so the idea is that enough of the quantization errors are sort of averaged out across the dimensions of the vector space to still be useful?

singularity2001 826 days ago

The way I think about it is finally it will end in a binary feature vector similar to 20Questions (male or female, alive or dead ...) just with 100s of dimensions

CamperBob2 827 days ago

The various sigmoid activation functions have the effect of keeping bit growth under control, by virtue of clamping to the +/- 1 range.

wongarsu 827 days ago

For training FP4 sounds pretty niche, but for inference it might be very useful.