| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SuchAnonMuchWow 660 days ago
	The goal of this type of quantization is to move the multiplication by the fp32 rescale factor outside of the dot-product accumulation. So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32.