|
|
|
|
|
by t-vi
1213 days ago
|
|
In my understanding, at a very high level and omitting many crucial details, the key is that when you have mainly largish matrix multiplications (as in transformers) well-behaved (mean zero uncorrelated random or so) quantization errors cancel out.
People do/did experiment with 1 or 2 bit compression of gradients/updates in the context of distributed training, but there it has been generally deemed useful to keep track of compression errors locally. |
|