Hacker News new | ask | show | jobs
by tastroder 2603 days ago
Glancing through the paper it seems like they use the recent Transformer model. Does whatever underlying stack they use expose something to share RNG seeds and the exact hardware optimizations your environment applies during training? Otherwise "publishing the seed" sounds nice but might not be as trivial as the phrase suggests.
5 comments

reproducibility should be something that's baked into an experiment's design.

so, if their experiment was designed such that reproduction is inherently difficult, they should have designed it in a better way, and they should've used a toolset that wouldn't run into that problem.

a non-reproducible experiment isn't necessarily completely without value, but it's a thing that everyone should look askance at till it proves its worth.

(apologies if my comments don't apply to this experiment and if it is reproducible -- i didn't have time to read through the OP, but i thought this reply was still a worthwhile response to its specific parent comment)

No that's absolutely a fair and true point, my comment was more pointed at the RNG aspect. I have not looked into this specific one either but normally people would hopefully not publish their best randomly achieved run if the system cannot reproduce it or similar results.

That being said the paper in question doesn't seem to reference open source code anyway so I guess my point was kind of moot, apologies.

For the most part, yes.

There are specific CUDA operations which are not guaranteed to be reproducible though, as well as some CuDNN operations which are non-determanistic without performance sacrifice, and this does cause real problems.

See https://pytorch.org/docs/stable/notes/randomness.html for some reasonable docs on this.

There are many CS conferences where you can/should submit a VM image to reproduce the results. See, e.g.: http://cavconference.org/2018/artifact-submission-and-evalua...
You want to be able to set the seed if only you want to be able to debug your program. Pseudo random is sufficient for these models and is independent of any hardware settings. You should not share your random source between concurrent threads, though, but that’s good practice anyway.
Most machine learning accelerators have a few non-deterministic operations. The chances that you could run trillions of floating point operations through a GPU and get a bit-for-bit identical result is low.
Really? I'm not an ML guy so in simple terms, what are these non-deterministic ops? Or are you saying GPUs can be expected to be, basically, faulty?
Both.

Some operations split and join data in non-deterministic ways (especially the order of operations, leading to different floating point rounding). If you shard across multiple machines, weight accumulation order will depend on network latency for example.

Also, GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.

> ...split and join data in non-deterministic ways ... to different floating point rounding

Ah, of course! A very timely reminder, thanks!

> GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.

Now that's worrying. A bit flip can't be expected to be skewed towards any particular bit within a float, so it could easily happen in the exponent, skewing a single value by orders of magnitude one way or the other. Combine that with the rest of your 'good' results and yuck. That's very concerning. Thanks for the warning.

> A bit flip can't be expected to be skewed towards any particular bit within a float,

Actually, I think they are - for example, the exponent path through an adder/multiplier is typically shorter, so when operated close to clock speed limits, the exponent is more likley to be correct.

(I've not actually verified the above on real hardware)