| Flushing subnormals to zero produces speed gains only on certain CPU models, while on others it almost does not have any effect. For example Zen CPUs have negligible penalties for handling denormals, but many Intel models have a penalty between 100 and 200 clock cycles for an operation with denormals. Even on the CPU models with slow denormal processing, a speedup between 100 and 1000 exists only for the operation with denormals itself and only when the operation belonged to a stream of operations working at the maximum CPU SIMD speed, when during the one hundred and something lost clock cycles the CPU could have done 4 or 8 operations during every clock cycle. Any complete computations cannot have a significant percentage of operations with denormals, unless they are written in an extremely bad way. So for a complete computation, even on the models with bad denormal handling, a speedup of more than a few times would be abnormal. The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers, instead of designing the FPU in the right way. When the correctness of the results is not important, like in many graphic or machine-learning applications, using flush-to-zero is OK, otherwise it is not. |
Outside of a vanishingly few edge cases, I think the subnormal debate is basically over, except, apparently, inside of Intel. Every single other architecture and microarchitecture manages to handle subnormals with relative ease, with only a handful of clock cycle penalty. I think Intel hardware should be called out, not programmers who just want the 35 year old floating point standard to be fast like it is on other chips.
Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.