| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by titzer 1683 days ago

When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.

Outside of a vanishingly few edge cases, I think the subnormal debate is basically over, except, apparently, inside of Intel. Every single other architecture and microarchitecture manages to handle subnormals with relative ease, with only a handful of clock cycle penalty. I think Intel hardware should be called out, not programmers who just want the 35 year old floating point standard to be fast like it is on other chips.

Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.

2 comments

SeanLuke 1683 days ago

> When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.

I recently built a modular additive music synthesizer called Flow (https://github.com/eclab/flow). When certain modules in the synthesizer [gradually] push certain state variables into the denormal range, my synthesizer will experience a roughly 100x slowdown. Mind you, this isn't due to DSP or even sound processing, and Flow isn't written in C, but in 100% pure *Java*. Since Java can't turn off denormals, I have to manually check for and zero them at strategic locations to avoid getting mired in the denormal quicksand.

link

adrian_b 1683 days ago

This is strange, so it is likely that it might be more of a Java problem than a CPU problem.

I do not know how Java handles this, but maybe it actually enables exceptions for underflow which invoke some handler.

Otherwise I cannot see how you can obtain such a huge slowdown, unless your code consists entirely of back-to-back operations with denormals and of nothing else.

I am not sure what you mean by "state variables", but if they are pushed into the denormal range, they should be changed to double, not float.

If you push double variables into the denormals range, then it is likely that the algorithm must be modified, because this should not happen.

Underflows, i.e. denormals, are difficult to avoid when using float variables, which can be mandatory in DSP algorithms for audio or video, but outside the arrays processed with SIMD instructions at maximum speed, the scalar variables can be double, which should never underflow in most correct algorithms.

For computations run on CPUs, not GPUs, only very seldom there can be reasons to use a scalar float variable. Normally float should be used only for arrays.

link

nitrogen 1683 days ago

entirely of back-to-back operations with denormals

In the context of sound, I could see this happening with an exponentially decaying envelope generator (or an IIR filter).

link

SeanLuke 1681 days ago

100% correct. Many audio and synthesis algorithms, including mine, perform LOTS of iterated exponential or high-polynomial decays on variables, such as x <- x * 0.1, or x <- x * x or whatnot. These decays rapidly pull values to the denormal range and keep them there, never hitting zero. Depending on the CPU, this in turn forces everything to go into microcode or software emulation, producing a gigantic slowdown. There are other common cases as well.

The only way to get around this in languages like Java, which cannot flush to zero, is to vigilantly check the values and flush them to zero manually.

link

spacechild1 1683 days ago

Yes. It is very easy to accidentally produce denormals in recursive audio algorithms.

link

jcelerier 1683 days ago

> unless your code consists entirely of back-to-back operations with denormals and of nothing else.

Ending with the data being entirely in the denormal range is a common occurrence in some audio algorithms (and in there, intel CPUs dominate by such a large margin it's not even funny) ; if that happens at the beginning of your signal processing pipeline you're in for a rough time

link

adrian_b 1683 days ago

I agree that this happens, but solving such cases with FTZ is a lazy solution, which is guaranteed to give bad results, due to the loss of precision.

Even when 32-bit floating point is used, for a greater dynamic range and for a 24-bit precision, instead of using 16-bit fixed-point numbers, proper DSP algorithm implementations still need to use some of the techniques that are necessary with fixed-point number algorithms, i.e. suitable scale factors must be inserted in various places.

A correct implementation must avoid almost all underflows and overflows, by appropriate scalings.

link

jcelerier 1683 days ago

the problem is (speaking from the end user side), you can't guarantee that every plug-in you are going to use is going to be coded properly - and you don't want that 2007 plug-in whose author has been dead for a decade but is super important for your sound to bring your whole performance down when it gets silence-ish input

link

titzer 1683 days ago

Complain to Intel. AMD and ARM chips have no such 100x penalties.

link

SeanLuke 1683 days ago

Perhaps true. But the point is: you're calling denormal failures "edge cases", yet my primary experience with denormals is exactly this.

link

titzer 1683 days ago

I sympathize. Software/hardware is littered with one person's edge cases being another person's entire world. But in the grand scheme, yes, subnormals are exceedingly rare. Clearly Intel microarchitecture designers think that, as they seem perfectly willing to continue punishing some applications with a massive performance cliff. Their mitigation should never have been "we'll add a cheat switch for speed" but rather "we'll work as hard as our competitors do to make these cases fast." Standards are supposed to do that, but cheaters abound (and yes I am being a bit perjorative--cheaters don't think of themselves as cheating, they merely have important use cases that demand special dispensation).

GPU hardware is a different, but similar story, from what I can see. It saves transistors to do FTZ, and the originally niche usage of FP to put pixels on the screen didn't really care so much about niggling details. But GPUs became general purpose and important, and they've been dragged into full compliance by application demands. It's the only sane outcome in the end. Instead, all this FTZ stuff has just made a mess at layers above. It would all be unnecessary if subnormals were as fast as AMD, ARM, IBM, and other chip manufacturers have managed to make them.

link

simonbyrne 1682 days ago

That's fascinating thread, thanks: https://github.com/WebAssembly/design/issues/148

link