Hacker News new | ask | show | jobs
by SeanLuke 1677 days ago
The other examples he gave trade off significant math deficiencies for small speed gains. But flushing subnormals to zero can produce a MASSIVE speed gain. Like 1000x. And including subnormals isn't necessarily good floating point practice -- they were rather controversial during the development of IEEE 754 as I understand it. The tradeoff here is markedly different than in the other cases.
2 comments

Flushing subnormals to zero produces speed gains only on certain CPU models, while on others it almost does not have any effect.

For example Zen CPUs have negligible penalties for handling denormals, but many Intel models have a penalty between 100 and 200 clock cycles for an operation with denormals.

Even on the CPU models with slow denormal processing, a speedup between 100 and 1000 exists only for the operation with denormals itself and only when the operation belonged to a stream of operations working at the maximum CPU SIMD speed, when during the one hundred and something lost clock cycles the CPU could have done 4 or 8 operations during every clock cycle.

Any complete computations cannot have a significant percentage of operations with denormals, unless they are written in an extremely bad way.

So for a complete computation, even on the models with bad denormal handling, a speedup of more than a few times would be abnormal.

The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers, instead of designing the FPU in the right way.

When the correctness of the results is not important, like in many graphic or machine-learning applications, using flush-to-zero is OK, otherwise it is not.

When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.

Outside of a vanishingly few edge cases, I think the subnormal debate is basically over, except, apparently, inside of Intel. Every single other architecture and microarchitecture manages to handle subnormals with relative ease, with only a handful of clock cycle penalty. I think Intel hardware should be called out, not programmers who just want the 35 year old floating point standard to be fast like it is on other chips.

Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.

> When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.

I recently built a modular additive music synthesizer called Flow (https://github.com/eclab/flow). When certain modules in the synthesizer [gradually] push certain state variables into the denormal range, my synthesizer will experience a roughly 100x slowdown. Mind you, this isn't due to DSP or even sound processing, and Flow isn't written in C, but in 100% pure *Java*. Since Java can't turn off denormals, I have to manually check for and zero them at strategic locations to avoid getting mired in the denormal quicksand.

This is strange, so it is likely that it might be more of a Java problem than a CPU problem.

I do not know how Java handles this, but maybe it actually enables exceptions for underflow which invoke some handler.

Otherwise I cannot see how you can obtain such a huge slowdown, unless your code consists entirely of back-to-back operations with denormals and of nothing else.

I am not sure what you mean by "state variables", but if they are pushed into the denormal range, they should be changed to double, not float.

If you push double variables into the denormals range, then it is likely that the algorithm must be modified, because this should not happen.

Underflows, i.e. denormals, are difficult to avoid when using float variables, which can be mandatory in DSP algorithms for audio or video, but outside the arrays processed with SIMD instructions at maximum speed, the scalar variables can be double, which should never underflow in most correct algorithms.

For computations run on CPUs, not GPUs, only very seldom there can be reasons to use a scalar float variable. Normally float should be used only for arrays.

entirely of back-to-back operations with denormals

In the context of sound, I could see this happening with an exponentially decaying envelope generator (or an IIR filter).

100% correct. Many audio and synthesis algorithms, including mine, perform LOTS of iterated exponential or high-polynomial decays on variables, such as x <- x * 0.1, or x <- x * x or whatnot. These decays rapidly pull values to the denormal range and keep them there, never hitting zero. Depending on the CPU, this in turn forces everything to go into microcode or software emulation, producing a gigantic slowdown. There are other common cases as well.

The only way to get around this in languages like Java, which cannot flush to zero, is to vigilantly check the values and flush them to zero manually.

Yes. It is very easy to accidentally produce denormals in recursive audio algorithms.
> unless your code consists entirely of back-to-back operations with denormals and of nothing else.

Ending with the data being entirely in the denormal range is a common occurrence in some audio algorithms (and in there, intel CPUs dominate by such a large margin it's not even funny) ; if that happens at the beginning of your signal processing pipeline you're in for a rough time

I agree that this happens, but solving such cases with FTZ is a lazy solution, which is guaranteed to give bad results, due to the loss of precision.

Even when 32-bit floating point is used, for a greater dynamic range and for a 24-bit precision, instead of using 16-bit fixed-point numbers, proper DSP algorithm implementations still need to use some of the techniques that are necessary with fixed-point number algorithms, i.e. suitable scale factors must be inserted in various places.

A correct implementation must avoid almost all underflows and overflows, by appropriate scalings.

Complain to Intel. AMD and ARM chips have no such 100x penalties.
Perhaps true. But the point is: you're calling denormal failures "edge cases", yet my primary experience with denormals is exactly this.
I sympathize. Software/hardware is littered with one person's edge cases being another person's entire world. But in the grand scheme, yes, subnormals are exceedingly rare. Clearly Intel microarchitecture designers think that, as they seem perfectly willing to continue punishing some applications with a massive performance cliff. Their mitigation should never have been "we'll add a cheat switch for speed" but rather "we'll work as hard as our competitors do to make these cases fast." Standards are supposed to do that, but cheaters abound (and yes I am being a bit perjorative--cheaters don't think of themselves as cheating, they merely have important use cases that demand special dispensation).

GPU hardware is a different, but similar story, from what I can see. It saves transistors to do FTZ, and the originally niche usage of FP to put pixels on the screen didn't really care so much about niggling details. But GPUs became general purpose and important, and they've been dragged into full compliance by application demands. It's the only sane outcome in the end. Instead, all this FTZ stuff has just made a mess at layers above. It would all be unnecessary if subnormals were as fast as AMD, ARM, IBM, and other chip manufacturers have managed to make them.

That's fascinating thread, thanks: https://github.com/WebAssembly/design/issues/148
> The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers

You could also say some companies have been kind enough to make hardware for gamers that doesn’t have costly features they do not need.

Except CPUs do have those features, they are just slow. FTZ is kind of a cheat mode for extra speed. The problem is that cheats just mushroom into software problems and generally make a crappier, less reliable platform. The situation is rife in computer hardware.
That's true, but the danger of flushing subnormals to zero is correspondingly worse because it's global CPU state and there's commonly used code that relies on not flushing subnormals to zero in order to work correctly, like `libm`. The example linked in the post is of a case where loading a shared library that had been compiled with `-Ofast` (which includes `-ffast-math`) broke a completely unrelated package because of this. Of course, the fact that CPU designers made this a global hardware flag is atrocious, but they did, so here we are.
Wait, what is "local" CPU state/hardware flag? In any case, since x64 ABI doesn't require MXCSR to be in any particular state on function entry, libm should set/clear whatever control flags it needs on its own (and restore them on exit since MXCSR control bits are defined to be callee-saved).
Local would be not using register flags at all and instead indicating with each operation whether you want flushing or not (and rounding mode, ideally). Some libms may clear and restore the control flags and some may not. Libm is just an example here and one where you're right that most of function calls that might need to avoid flushing subnormals to zero are expensive enough that clearing and restoring flags is an acceptable cost. However, that's not always the case—sometimes the operation in question is a few instructions and it may get inlined into some other code. It might be possible to handle this better at the compiler level while still using the MXCSR register, but if it is, LLVM certainly can't currently do that well.
In theory, every function should do that to check things like rounding mode etc. But that would be pretty slow, especially for low-latency operations (modifying mxcsr will disrupt pipelining for example).
That wouldn't be practical. C math library performance really matters for numerical-intensive apps like games.