Hacker News new | ask | show | jobs
by frogblast 1339 days ago
The 2xFP32 solution is also dramatically faster than FP64 on nearly all GPUs.

While most GPUs support FP64, unless you pay for the really high-end scientific computing models, you're typically getting 1/32nd rate compared to FP32 performance. Even your shiny new RTX 4090 runs FP64 at 1/64th rate.

2xFP32 for most basic operations can be 1/4th the rate of FP32. It is quite often the superior solution compared to using the FP64 support provided in GPU languages.

1 comments

>While most GPUs support FP64, unless you pay for the really high-end scientific computing models, you're typically getting 1/32nd rate compared to FP32 performance.

I wonder if there is a hardware reason for this or It's just market segmenting by nvidia.

Mostly market segmentation. There is a software lock to a certain ratio (of clock speed) to the FP32 performance that varies by the card. For most consumer NVIDIA cards it is locked to 1/24 of FP32 speed to prevent use in professional settings that require FP64 performance. However, some cards, such as the Radeon VII, is only locked to 1/4 of FP32 speed (much faster)
My naive guess is that most floating point code uses FP32 and FP64 uses at least double the die size. So optimize for FP32 and have some FP64 for the rare equations that need it.
These compute units are usually sliced - they can perform either four FP32 multiples or one FP64 multiply on the same die part. This trick was done as long ago as PA-RISC was developed, from what I remember it was HP who introduced sliced ALU, capable of doing one large or several smaller operations on the same hardware.

I can be wrong about who did that first, but most FPUs now are done like that.

On GPUs, they're not sliced like this anymore since quite a long time to save die area.
The slicing was introduced to save die area. Not to slice is to have slightly smaller computation delay traded for greater die area.