Hacker News new | ask | show | jobs
by npn 106 days ago
you think in FP16. nobody uses FP16 for inference anymore. 400% probably for FP4/INT4 computation.
1 comments

Tensor core performance is inversely proportional to precision across all generations (i.e., reducing precision by a factor of 2 increases OPS by a factor of 2). 8-bit precision will give you the same improvement ratio. A100/H100 didn't support 4-bit if I remember correctly.

So FP4/INT4 will likely improve the same 30% OPS/W. You could get a separate improvement by reducing precision, but going 1-bit for 4x improvement feels unlikely for now.