Hacker News new | ask | show | jobs
by adrian_b 1683 days ago
Flushing subnormals to zero produces speed gains only on certain CPU models, while on others it almost does not have any effect.

For example Zen CPUs have negligible penalties for handling denormals, but many Intel models have a penalty between 100 and 200 clock cycles for an operation with denormals.

Even on the CPU models with slow denormal processing, a speedup between 100 and 1000 exists only for the operation with denormals itself and only when the operation belonged to a stream of operations working at the maximum CPU SIMD speed, when during the one hundred and something lost clock cycles the CPU could have done 4 or 8 operations during every clock cycle.

Any complete computations cannot have a significant percentage of operations with denormals, unless they are written in an extremely bad way.

So for a complete computation, even on the models with bad denormal handling, a speedup of more than a few times would be abnormal.

The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers, instead of designing the FPU in the right way.

When the correctness of the results is not important, like in many graphic or machine-learning applications, using flush-to-zero is OK, otherwise it is not.

2 comments

When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.

Outside of a vanishingly few edge cases, I think the subnormal debate is basically over, except, apparently, inside of Intel. Every single other architecture and microarchitecture manages to handle subnormals with relative ease, with only a handful of clock cycle penalty. I think Intel hardware should be called out, not programmers who just want the 35 year old floating point standard to be fast like it is on other chips.

Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.

> When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.

I recently built a modular additive music synthesizer called Flow (https://github.com/eclab/flow). When certain modules in the synthesizer [gradually] push certain state variables into the denormal range, my synthesizer will experience a roughly 100x slowdown. Mind you, this isn't due to DSP or even sound processing, and Flow isn't written in C, but in 100% pure *Java*. Since Java can't turn off denormals, I have to manually check for and zero them at strategic locations to avoid getting mired in the denormal quicksand.

This is strange, so it is likely that it might be more of a Java problem than a CPU problem.

I do not know how Java handles this, but maybe it actually enables exceptions for underflow which invoke some handler.

Otherwise I cannot see how you can obtain such a huge slowdown, unless your code consists entirely of back-to-back operations with denormals and of nothing else.

I am not sure what you mean by "state variables", but if they are pushed into the denormal range, they should be changed to double, not float.

If you push double variables into the denormals range, then it is likely that the algorithm must be modified, because this should not happen.

Underflows, i.e. denormals, are difficult to avoid when using float variables, which can be mandatory in DSP algorithms for audio or video, but outside the arrays processed with SIMD instructions at maximum speed, the scalar variables can be double, which should never underflow in most correct algorithms.

For computations run on CPUs, not GPUs, only very seldom there can be reasons to use a scalar float variable. Normally float should be used only for arrays.

entirely of back-to-back operations with denormals

In the context of sound, I could see this happening with an exponentially decaying envelope generator (or an IIR filter).

100% correct. Many audio and synthesis algorithms, including mine, perform LOTS of iterated exponential or high-polynomial decays on variables, such as x <- x * 0.1, or x <- x * x or whatnot. These decays rapidly pull values to the denormal range and keep them there, never hitting zero. Depending on the CPU, this in turn forces everything to go into microcode or software emulation, producing a gigantic slowdown. There are other common cases as well.

The only way to get around this in languages like Java, which cannot flush to zero, is to vigilantly check the values and flush them to zero manually.

Yes. It is very easy to accidentally produce denormals in recursive audio algorithms.
> unless your code consists entirely of back-to-back operations with denormals and of nothing else.

Ending with the data being entirely in the denormal range is a common occurrence in some audio algorithms (and in there, intel CPUs dominate by such a large margin it's not even funny) ; if that happens at the beginning of your signal processing pipeline you're in for a rough time

I agree that this happens, but solving such cases with FTZ is a lazy solution, which is guaranteed to give bad results, due to the loss of precision.

Even when 32-bit floating point is used, for a greater dynamic range and for a 24-bit precision, instead of using 16-bit fixed-point numbers, proper DSP algorithm implementations still need to use some of the techniques that are necessary with fixed-point number algorithms, i.e. suitable scale factors must be inserted in various places.

A correct implementation must avoid almost all underflows and overflows, by appropriate scalings.

the problem is (speaking from the end user side), you can't guarantee that every plug-in you are going to use is going to be coded properly - and you don't want that 2007 plug-in whose author has been dead for a decade but is super important for your sound to bring your whole performance down when it gets silence-ish input
Complain to Intel. AMD and ARM chips have no such 100x penalties.
Perhaps true. But the point is: you're calling denormal failures "edge cases", yet my primary experience with denormals is exactly this.
I sympathize. Software/hardware is littered with one person's edge cases being another person's entire world. But in the grand scheme, yes, subnormals are exceedingly rare. Clearly Intel microarchitecture designers think that, as they seem perfectly willing to continue punishing some applications with a massive performance cliff. Their mitigation should never have been "we'll add a cheat switch for speed" but rather "we'll work as hard as our competitors do to make these cases fast." Standards are supposed to do that, but cheaters abound (and yes I am being a bit perjorative--cheaters don't think of themselves as cheating, they merely have important use cases that demand special dispensation).

GPU hardware is a different, but similar story, from what I can see. It saves transistors to do FTZ, and the originally niche usage of FP to put pixels on the screen didn't really care so much about niggling details. But GPUs became general purpose and important, and they've been dragged into full compliance by application demands. It's the only sane outcome in the end. Instead, all this FTZ stuff has just made a mess at layers above. It would all be unnecessary if subnormals were as fast as AMD, ARM, IBM, and other chip manufacturers have managed to make them.

That's fascinating thread, thanks: https://github.com/WebAssembly/design/issues/148
> The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers

You could also say some companies have been kind enough to make hardware for gamers that doesn’t have costly features they do not need.

Except CPUs do have those features, they are just slow. FTZ is kind of a cheat mode for extra speed. The problem is that cheats just mushroom into software problems and generally make a crappier, less reliable platform. The situation is rife in computer hardware.