| HN Mirror

Tensor core performance is inversely proportional to precision across all generations (i.e., reducing precision by a factor of 2 increases OPS by a factor of 2). 8-bit precision will give you the same improvement ratio. A100/H100 didn't support 4-bit if I remember correctly.

So FP4/INT4 will likely improve the same 30% OPS/W. You could get a separate improvement by reducing precision, but going 1-bit for 4x improvement feels unlikely for now.