|
|
|
|
|
by SuchAnonMuchWow
613 days ago
|
|
The goal of this type of quantization is to move the multiplication by the fp32 rescale factor outside of the dot-product accumulation. So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32. |
|