Hacker News new | ask | show | jobs
by queuebert 930 days ago
In the old days, depending on architecture, fp64 performance could be atrocious even when fp32 was decent, so bragging about fp64 performance has an authenticity to it. Not all scientific computing requires 64 bits, but knowing that you can drop to high precision when necessary without penalty is nice.

Also, back in the day, integer ops were just called 'ops', grumble grumble. But yeah FLOPS specifically refers to floating point. Calling them TOPS doesn't make sense to me, since tensor cores were meant for matrix operation speedup, and these matrices are rarely integer.

1 comments

Still true that fp64 throughput is lower for consumer GPUs - both NV and AMD. That’s kinda why I was curious about leading with that metric - outside of HPC and scientific applications, a lot of people don’t really need fp64, and the machine might normally have a much higher fp32 throughput.

> knowing you can drop to high precision when necessary without penalty is nice.

I guess I maybe don’t know why you’d ever have 1:1 fp32 and fp64 perf. Aren’t the fp64 multipliers (for example) basically 4x fp32 multipliers? I am under the possibly naive impression that if you have all the transistors for 1 fp64 core, that you’d end up with all the transistors you need for 2 or 4 fp32 cores. Maybe that’s not true today, but there does have to be at least 2x the transistors overall for 64-bit vs 32-bit, and lots of those should be shared or reusable, no? It doesn’t seem quite right to frame naturally higher 32-bit op throughput as a “penalty” on 64-bit ops. You’re asking the hardware to do more with 64, and it makes complete sense that given the exact same budget for bandwidth, energy, memory, compute, etc. that 32-bit ops would go faster, no? If the op throughput of fp64 and fp32 is the same, doesn’t that possibly imply that the fp32 ops are potentially being wasted / penalized, just for the sake of having matching numbers?

This is also related to "fast" versions of all some operations. You might want the full 32 bit float but you dont want or need to do full precision division or sqrt operations. This is common in games/graphics and probably machine learning.
You're right -- I have no idea why fp64 wouldn't be half the speed of fp32, and traditionally it is. I was simply taking them at their word. Maybe they're exaggerating or maybe they did what you suggest and hamstrung fp32.