| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fxtentacle 942 days ago
	It's actually much faster if you're limited by RAM bandwidth because instead of doing float x float mul, which requires 8 bytes of load and 4 bytes of store, you do an int8 x int8 mul with 2 bytes in and 1 byte out. And typically for a quantized LNN like this, you'd only do packing and unpacking before or after a matmul on the low-dimensional vectors so that you can directly use the quantized weights. E.g. you quantize a 512-float activation to 512-int8, then matmul with 512x4096, Gelu, 4096x512 all in int8, then de-quantize to 512-float. That means no quantization overhead on those 4,194,304 parameters in your Dense layers.