Hacker News new | ask | show | jobs
by dahart 930 days ago
Still true that fp64 throughput is lower for consumer GPUs - both NV and AMD. That’s kinda why I was curious about leading with that metric - outside of HPC and scientific applications, a lot of people don’t really need fp64, and the machine might normally have a much higher fp32 throughput.

> knowing you can drop to high precision when necessary without penalty is nice.

I guess I maybe don’t know why you’d ever have 1:1 fp32 and fp64 perf. Aren’t the fp64 multipliers (for example) basically 4x fp32 multipliers? I am under the possibly naive impression that if you have all the transistors for 1 fp64 core, that you’d end up with all the transistors you need for 2 or 4 fp32 cores. Maybe that’s not true today, but there does have to be at least 2x the transistors overall for 64-bit vs 32-bit, and lots of those should be shared or reusable, no? It doesn’t seem quite right to frame naturally higher 32-bit op throughput as a “penalty” on 64-bit ops. You’re asking the hardware to do more with 64, and it makes complete sense that given the exact same budget for bandwidth, energy, memory, compute, etc. that 32-bit ops would go faster, no? If the op throughput of fp64 and fp32 is the same, doesn’t that possibly imply that the fp32 ops are potentially being wasted / penalized, just for the sake of having matching numbers?

2 comments

This is also related to "fast" versions of all some operations. You might want the full 32 bit float but you dont want or need to do full precision division or sqrt operations. This is common in games/graphics and probably machine learning.
You're right -- I have no idea why fp64 wouldn't be half the speed of fp32, and traditionally it is. I was simply taking them at their word. Maybe they're exaggerating or maybe they did what you suggest and hamstrung fp32.