Hacker News new | ask | show | jobs
by tmalsburg2 826 days ago
> - Training isn’t done at 4-bits, to date this small size has only been for inference.

Wasn't there a paper from Microsoft two weeks ago or so where they trained on log₂(3) bits?

Edit: https://arxiv.org/pdf/2402.17764.pdf

1 comments

They don't "train on log₂(3) bit". Gradients and activations are still calculated at full (8-bit) precision and weights are quantised after every update.

This makes network minimise loss not only with regard to expected outcome but also minimises loss resulting from quantisation. With big networks their "knowledge" is encoded in relationships between weights, not in their absolute values so lower precision work well as long as network is big enough.