|
|
|
|
|
by yorwba
99 days ago
|
|
Well, the weights are accumulated in full precision and are multiplied by a full-precision scale factor after quantization, and the activations and backward pass are computed in full precision as well, so it's not quite true 4-bit precision training. The resulting model can be stored with just slightly more than 4 bits per parameter, though. |
|
I can easily understand how the block formats win.