Hacker News new | ask | show | jobs
by dibanez 3925 days ago
I also work in HPC and have been doing a better job of this but still struggling in some areas. For example, the sin() and cos() implementations are different on some machines, and their only guarantee is being within some ulp of the right answer. So far haven't come up with a rounding idea that will make those values the same again.

This is worse for our codes because our mesh structure changes topology based on floating point comparisons, so error in values turns into different discrete structures.

Parallelism introduces more headaches, making some codes non-deterministic between runs on the same machine, although I've been able to fix this for the most part.

Its important to break down "determinism" into consistency across changes in something, whether it be hardware or time or partitioning.

1 comments

If your parallel reductions are consistent, you're cool in my book. That said, I've solved a lot of dynamic load-balancing behavior with 64-bit fixed point reductions and decisions. Maybe that will work for you?

Unless the timings themselves are non-deterministic, sigh. But then I'm far more sympathetic. I have not found this sort of thing to be the case for the most part and when I have, I've used a deterministic measure rather than timing (i.e. number of calculations as opposed to how long they take).

For even compiler revisions are sufficient to break FP32 associativity let along different transcendental approximations (we ought to fix that, no?).

Where I get angry is when people sloppily use FP32 for everything or FP64 because it's double(tm) and then insist determinism isn't possible. That isn't even science IMO.

I don't understand the obsession with bitwise reproducibility. It seems like if your algorithm is numerically stable and valid, you can just compare multiple inexact test runs with some margin of error. Even if you hire nothing but guru-level IEEE floating point experts, a focus on bitwise reproducibility will close off a lot of opportunities to parallelize the code.

A lot of machine learning tools like random forests and neural networks inherently inject randomness. Are we just going throw up our hands and say we only use classical deterministic algorithms run in single threaded mode, because we can't think of any way to compare multiple test runs except memcmp? Because that's what I'm hearing (maybe I'm missing something).

Moore's Law w/r to clock frequency ran out over a decade ago. And that sucks, because it's trivial to make single-threaded code reproducible.

But parallel code is how one continues to benefit from Moore's law w/r to multiple cores and ever-increasing SIMD width and SIMD units. And it happened at exactly the same time as the migration to mostly single-threaded weakly-typed languages began.

For if your results aren't reproducible, there's no way to detect if your code has a race condition that is reducing the efficacy of your methods.

For an application like molecular dynamics, it's important to conserve overall energy. Any such inconsistency is the equivalent of setting off tiny little hand grenades in the simulation. Inconsistent summation obscures this without a great deal of work to sample many independent simulations. Compare and contrast to running things twice to sniff this out in deterministic code.

For machine learning, it can amount to anything from a harmless implicit regularizer to the AI equivalent of Gary Busey taking the wheel and driving you straight to crazy town.

I speak from experience in both cases. And speaking from experience, as long as your reductions are associative, you'll be fine. That can be achieved with fixed point atomics, or if you don't have them, reduction buffers, or finally, a deterministic reduction algorithm if you have neither. I've used all of the above and they have at most cost 2-3% more than the non-deterministic alternatives (usually much less).

Finally, I inject randomness like crazy, but I do so in a reproducible manner. Confusing determinism and randomness reminds me of people who don't understand the difference between precision and accuracy (TLDR: precision is easy, accuracy is tough).

I've seen race conditions that only happen 1 time in 1000. Or race conditions that only happen when the machine is under load. I agree that it's frustrating to have to debug non-deterministic code, but I feel like there must be a better way to compare test runs than memcmp on the resutls.

You are right that deterministic pseudo-random numbers can be used when "random" inputs are required. I overlooked that... thanks for pointing it out.

One way I addressed this exact situation: Compute a hash on the state after each iteration and save it. Now run again normally and save several iterations worth of state somewhere. When the hash diverges, you have hit the race condition.

Do this a few times until you've characterized what's happening and your brain figures out how and why by seeing the pattern this will (hopefully) reveal.

Dunno from obsession, but I like having a regression suite available, and bitwise reproducible means you can run the test suite as if it were a "diff", and no output means it passed.