| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by WhitneyLand 827 days ago
	- Training isn’t done at 4-bits, to date this small size has only been for inference. - Research for a while now has been finding that smaller weights are surprisingly effective. It’s kind of a counterintuitive result, but one way to think about it is there are billions of weights working together. So taken as a whole you still have a large amount of information.

3 comments

acchow 827 days ago

Intuitively, there is a ton of redundancy and we still have a long way we can still compress things.

link

imtringued 826 days ago

Each token is represented by a vector of 4096 floats. Of course there is redundancy.

link

tmalsburg2 826 days ago

> - Training isn’t done at 4-bits, to date this small size has only been for inference.

Wasn't there a paper from Microsoft two weeks ago or so where they trained on log₂(3) bits?

Edit: https://arxiv.org/pdf/2402.17764.pdf

link

terramex 826 days ago

They don't "train on log₂(3) bit". Gradients and activations are still calculated at full (8-bit) precision and weights are quantised after every update.

This makes network minimise loss not only with regard to expected outcome but also minimises loss resulting from quantisation. With big networks their "knowledge" is encoded in relationships between weights, not in their absolute values so lower precision work well as long as network is big enough.

link

coffeebeqn 827 days ago

Maybe the rounding errors are noise that is somewhat useful in a big enough neutral net. Image generators also generate noise to work on

link