Hacker News new | ask | show | jobs
by touisteur 1356 days ago
Tell me about non guaranteed order of operations in GPU reductions and floating point results changing slightly between two runs. Yes it's useful and you get the goddamn FP32 TFLOPS, but damn it makes testing, validating, qualifying systems harder. And yes, I know one shouldn't rely and test on equality, but not knowing the actual order of FP operations makes numerical analysis of the actual error harder (just take the worst case of every reduction, ugh).

EDIT: and don't get me started on tensor cores and clever tricks to have them do 'fp32-alike' accuracy. Yes, wonderful magic but how do you reason about these new objects without a whole new slew of tools.

1 comments

There is nothing wrong with testing for equality so long as your computation is exact or correctly-rounded.
There is if the algorithm contains race conditions that cause non-deterministic output. The submitted article goes above and beyond to guarantee that the code always output the same answer even though it has race conditions. But that's sometimes not possible or, if it's possible, it's too much of a hassle so it's rarely done.

For example, this project https://github.com/EmbarkStudios/texture-synthesis generates textures and if you run the same code with the same input various times, the results will be slightly different. Here https://github.com/EmbarkStudios/texture-synthesis#notes it says: "When using multiple threads for generation, the output image is not guaranteed to be deterministic with the same inputs. To have 100% determinism, you must use a thread count of one"

I gave conditions upon which testing for equality is correct.

Of course if the result is non-deterministic it doesn't satisfy those conditions.

Unfortunately it may be quite hard to certify that your program qualifies, specially if it's a high performance program that can't be single threaded. An innocuous-looking commit can completely undermine this property.

Doubly so if you must guarantee determinism across multiple platforms! IEEE 754-2008 helps but it defines cross-platform determinism for just a subset of operations. Compilers can also sometimes botch your FP code (there's a number of gotchas - for example, if any library anywhere in your program uses -ffast-math, it may infect the whole program https://stackoverflow.com/questions/68938175/what-happens-if...)

Achieving exact or correctly-rounded results is much more work than that.

You can look at the crlibm papers for example.

Yes, I know. For purely sequential code that's the actual use (though sometimes, golden tests are generated through matlab or python, in double precision and then every divergence becomes a game of whack a mole. And don't start me on x87-80 bits extended precision suddenly compiled to SSE, so actual ieee754... We have integrated some of the FP static and dynamic analysis tools in our CI/CD pipeline for new code but ugh...

Anyway, as time passes by I veer off equality and think about the actual necessary accuracy and wish there was a way to set it as a spec for proof (SPARK/Ada or a higher level DSL that can be lowered to proper accuracy analysis tools...

I wish I could also specify 'no NaNs please' as a postcondition. Need to check in with the SPARK team and get an introduction article going...

There are simple tools that tell you how many of your floating-point digits are just propagated rounding errors.
I think some people have gotten the mistaken idea that floating point arithmetic is inherently somehow non-deterministic. It is of course entirely deterministic, and if you do the same FP operations in the same order you will get the same result.
I was thinking and talking specifically of GPUs and reductions (aka multiple core interacting) during which you don't always know the exact order of operations.

And also, tell that to the people that went from a compiler using x87 instructions to one using SSE instructions and between two binaries from the same code get different results. Yes, the exact same suite of FP instructions should always give the same results. And that's also supposing you're not loading some library that sets ugly fast-maths flags (see the recent yak-shaving session by @moyix).

You get the same result if you run on the same architecture and with the same instructions (or if you run on machines that implement IEEE 754-2008 [*], which was the first standard that guaranteed cross-platform determinism for a subset of floating point operations, which means, no SIMD!! =/), and you don't have non-determinism introduced by thread interleaving and race conditions (unless you very carefully account for that, like the article submitted in this thread)

I wish we had a language that guaranteed that the results of a computation were deterministic, all the while it properly enabled the use of all available hardware resources (so: using SIMD, all CPU cores, and also offloading some code to the a GPU if available), even if it had some overhead. Doing this manually is ridiculously difficult if you want to write high performance software, specially if you use GPUs.

[*] See https://stackoverflow.com/questions/42181795/is-ieee-754-200... - the amazing Rapier physics engine https://rapier.rs/ leverages IEEE 754-2008 to have a cross-platform deterministic mode that will run physics exactly the same way in every supported platform https://rapier.rs/docs/user_guides/rust/determinism/ - but this means taking a huge performance hit: you can't use SIMD and you must run the physics on a single thread.

Typical technobabble from someone who doesn't really understand floating-point.

Floating-point is not associative. Reordering operations yields different results, so no compiler will do so, unless you specifically disable standards conformance.

The use of SIMD, which is just a type of instruction-level parallelism, has no effect on the result of floating-point operations, unless of course you reorder your operations so that they may be parallelized.

What does affect the result of floating-point operations is when rounding happens and at what precision. If we're talking about C, the compiler is allowed to run intermediate operations with higher precision than that mandated by its type. This is merely so that it can use x87 which is 96-bit long by default and only round when it spills to memory and needs to store a 64-bit or 32-bit value. Compilers have flags to disable that behaviour, and it doesn't apply when the SSE unit instead of x87 is used. Using SSE for floating-point doesn't necessarily mean it's using SIMD, most of the instructions have scalar variants.

Another example is FMA, which might be substituted for any multiply+add operations.

In practice if your code breaks with this it just means it was incorrect in the first place.

The actual rules are very complicated. C allows greater precision for intermediate results but compilers are sometimes careful to stick to IEEE rounding. [1] contains a good general overview, and [2] talks about FMA in particular. And in [3] I've set up a Godbolt example to play with. By default -O3 gives you FMA, but -O or -O3 with -ffp-contract=off don't. So you absolutely can get different results depending on optimization levels.

[1]: https://randomascii.wordpress.com/2012/03/21/intermediate-fl...

[2]: https://kristerw.github.io/2021/11/09/fp-contract/

[3]: https://godbolt.org/z/eTz8o6b3P

The rule is very simple, I'm not seeing anything in what you say suggesting that it isn't?
You do realize programming languages don’t necessarily manipulate the architectures native floating point operations, but are free to define any semantics they want? You know, like it could have number types that work like in math, e.g. symbolic math tools does exactly that.

Also, that kind of language is absolutely not warranted.