Hacker News new | ask | show | jobs
by titzer 1039 days ago
> Attempting to get bit-exact reproducible results across different hardware is a fool's errand (if you care in the least about performance).

We did it for Wasm, which follows IEEE-754 semantics exactly for 32-bit and 64-bit floats. (The only nondeterminism is the exact bit pattern you get for NaNs in some circumstances.) Rounding is 100% well-specified. And CPUs have done that for decades. Even vector ISAs have learned that non-IEEE results are not what software wants; all vector ISAs are converging on IEEE-754.

> Optimized code is going to give you different results on different hardware due to the fact that you need to optimize things differently.

This is due to C/C++ (and to some extent Fortran) semantics. It is not hardware.

What do threads have to do with floating point precision?

2 comments

Oh, it's entirely possible to get bit-reproducible results. Just not in a performance portable fashion.

Different microarchitectures (e.g. how many vector instructions of what size need to be in flight for full occupancy), different numbers of cores (see threading discussion below) and often even differently aligned memory (does it need repacked or not for best performance?) will all require different order of operations to obtain maximum throughput, which means different (but equally valid) results.

For threading in particular if you want to get the same bit-exact answer, you end up constraining yourself to a particular ordering on reduction operations. This in turn either outright prevents techniques such as work-stealing or fires a very prescriptive reduction tree that itself constrains parallelism.

This is entirely driven by hardware and its impacts on performance of algorithms, and applies regardless of the language you're writing in if you want to obtain the best possible performance from a given chip.

I’m not the parent but I imagine they’re referring to e.g., some FFTs use different partitioning strategies in different threading environments, which breaks bit-perfect replication.

There’s also the weirdness that in C++ the floating point environment is thread-local, which can cause all sorts of chaos.