Hacker News new | ask | show | jobs
by owlbite 1051 days ago
Attempting to get bit-exact reproducible results across different hardware is a fool's errand (if you care in the least about performance).

The nature of the beast is that as soon as you change the order of arithmetic you're going to get a different result. Optimized code is going to give you different results on different hardware due to the fact that you need to optimize things differently. Threading, memory alignment and/or different versions of the library software are likely to lead to different results even on the same machine unless the authors of the library go out of the way to promise repeatability.

(If you want to get the same answer, run on a single thread, page align everything you feed in, and never upgrade your system; alternatively write a scalar loop in C, compile with -O0 and pray the compiler doesn't change the order of things on its next upgrade).

3 comments

> Attempting to get bit-exact reproducible results across different hardware is a fool's errand (if you care in the least about performance).

We did it for Wasm, which follows IEEE-754 semantics exactly for 32-bit and 64-bit floats. (The only nondeterminism is the exact bit pattern you get for NaNs in some circumstances.) Rounding is 100% well-specified. And CPUs have done that for decades. Even vector ISAs have learned that non-IEEE results are not what software wants; all vector ISAs are converging on IEEE-754.

> Optimized code is going to give you different results on different hardware due to the fact that you need to optimize things differently.

This is due to C/C++ (and to some extent Fortran) semantics. It is not hardware.

What do threads have to do with floating point precision?

Oh, it's entirely possible to get bit-reproducible results. Just not in a performance portable fashion.

Different microarchitectures (e.g. how many vector instructions of what size need to be in flight for full occupancy), different numbers of cores (see threading discussion below) and often even differently aligned memory (does it need repacked or not for best performance?) will all require different order of operations to obtain maximum throughput, which means different (but equally valid) results.

For threading in particular if you want to get the same bit-exact answer, you end up constraining yourself to a particular ordering on reduction operations. This in turn either outright prevents techniques such as work-stealing or fires a very prescriptive reduction tree that itself constrains parallelism.

This is entirely driven by hardware and its impacts on performance of algorithms, and applies regardless of the language you're writing in if you want to obtain the best possible performance from a given chip.

I’m not the parent but I imagine they’re referring to e.g., some FFTs use different partitioning strategies in different threading environments, which breaks bit-perfect replication.

There’s also the weirdness that in C++ the floating point environment is thread-local, which can cause all sorts of chaos.

...or use fixed-point arithmetic. Which, if I understand correctly, is basically the go-to of modern multiplayer-enabled game engines.
The only reason you would want bit-reproducibility is because you haven't done the numerical analysis and have no clue how many digits of your "answer" to trust.

As far as I know, two sectors claim they need it: finance and climate.

"Do you want a better answer?"

"No, I want the same wrong answer that I got last Tuesday."

Science/Mathematics can't fix this.

> The only reason you would want bit-reproducibility is because you haven't done the numerical analysis and have no clue how many digits of your "answer" to trust.

I can confidently say that this is not the only good reason. Other reasons include:

- You want to compare different runs by hashing outputs (e.g. to find the first computation step where they diverged). Very useful for debugging, and also useful to determine whether you accurately reproduced a result (e.g. a customer problem).

- If your program has a single floating point comparison, there is no such thing as "enough significant digits" - with reasonable assumptions about the distribution of "unreproducability", your logic is now divergent (and your output will jump between different values) with a certain probability. At that point we're no longer talking numerical analysis, it's straight up "divergent results".

There's also "cover your ass". At least I've heard tales of major aerospace companies keeping warehouses of old sun hardware in case they need to demonstrate the simulations they ran back in the 90s were not fabricated...
I’ve yet to meet a customer that cares enough to pay for the necessary numerical analysis.