| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by talldayo 537 days ago
	> The M4 Max has 546GB/s of memory bandwidth and ~34TFLOPS (fp16) = ~68 GB/s, a ratio of ~8.02. Whereas NVIDIA RTX 4090 has 1008GB/s memory bandwidth and ~330TFLOPS (fp16) = ~660GB/s, a ratio of ~1.52. Why are we comparing FP16 performance when you're inferencing INT4 quantized models? Seems like a misleading figure to compare with when it's not really even the performance you're measuring.

1 comments

boroboro4 537 days ago

Because INT4 quantized weights still use FP16 compute in most cases. Sometimes it's possible to use FP8/INT8 compute, and there is research to use INT4 compute, but it's rather rare.

link