Hacker News new | ask | show | jobs
by _delirium 1889 days ago
NumPy does auto-seed the RNG if you don't pass a seed yourself, using platform-specific code to pull some entropy from the OS. So that common case is handled reasonably well, unlike with C. In fact if you want exactly reproducible results (e.g. in testcases), you have to seed with a known seed, to avoid that default behavior.

The issue here is a little more subtle: if you fork 10 copies of your Python process, all 10 inherit the current RNG state, and will thereafter produce identical random number sequences. If you were manually forking, you might guess that was a potential problem, and re-seed the RNGs after forking. But PyTorch's data loaders fork a bunch of processes to do things in parallel, so users might not realize that they're using duplicate copies of their RNG state.

1 comments

It’s even slightly more subtle than that.

Python multiprocessing doesn’t use fork on Windows. It starts a new process and so shouldn’t be affected by this.

So to trigger this you need to have num_processes != 0 on your DataLoader and be running on a non-Windows platform.

I get the desire to be pedantic, but does anyone at all train DL models on Windows? (barring toy projects for fun and perhaps debugging) The same can be said about num_workers > 0. You _have to_ fork worker threads unless you train something super tiny like MNIST and you load the whole dataset on GPU.
> does anyone at all train DL models on Windows?

Yes. My last job was at a financial shop that was all Windows. They were doing ML with Python on Windows. Azure has boxes available for this.

Starting with Python 3.8, multiprocessing will also use new processes by default on MacOS (due to some system libraries not being fork-safe).

IMHO cross-platform Python projects should call `multiprocessing.set_start_method('spawn')` to get the same behavior everywhere.