Hacker News new | ask | show | jobs
by JonChesterfield 1475 days ago
fp32 uses much less silicon and power than fp64. I think the scaling is roughly quadratic in both, so 4x performance is free.

I vaguely remember a consumer card having 1/4 the fp64 units of a similar data center one so that would get the 16x on paper.

Memory bandwidth / register file size would suggest another 2x from moving less data. My working heuristic on these things is compute is free because I fail to saturate the memory bus but no doubt some applications do actually run into that slowdown in practice. Matrix multiply probably does.