| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by t-vi 1213 days ago
	In my understanding, at a very high level and omitting many crucial details, the key is that when you have mainly largish matrix multiplications (as in transformers) well-behaved (mean zero uncorrelated random or so) quantization errors cancel out. People do/did experiment with 1 or 2 bit compression of gradients/updates in the context of distributed training, but there it has been generally deemed useful to keep track of compression errors locally.