Hacker News new | ask | show | jobs
by talldayo 537 days ago
> The M4 Max has 546GB/s of memory bandwidth and ~34TFLOPS (fp16) = ~68 GB/s, a ratio of ~8.02. Whereas NVIDIA RTX 4090 has 1008GB/s memory bandwidth and ~330TFLOPS (fp16) = ~660GB/s, a ratio of ~1.52.

Why are we comparing FP16 performance when you're inferencing INT4 quantized models? Seems like a misleading figure to compare with when it's not really even the performance you're measuring.

1 comments

Because INT4 quantized weights still use FP16 compute in most cases. Sometimes it's possible to use FP8/INT8 compute, and there is research to use INT4 compute, but it's rather rare.