| HN Mirror

Oh, it's entirely possible to get bit-reproducible results. Just not in a performance portable fashion.

Different microarchitectures (e.g. how many vector instructions of what size need to be in flight for full occupancy), different numbers of cores (see threading discussion below) and often even differently aligned memory (does it need repacked or not for best performance?) will all require different order of operations to obtain maximum throughput, which means different (but equally valid) results.

For threading in particular if you want to get the same bit-exact answer, you end up constraining yourself to a particular ordering on reduction operations. This in turn either outright prevents techniques such as work-stealing or fires a very prescriptive reduction tree that itself constrains parallelism.

This is entirely driven by hardware and its impacts on performance of algorithms, and applies regardless of the language you're writing in if you want to obtain the best possible performance from a given chip.