Hacker News new | ask | show | jobs
by shoyer 1888 days ago
This post is yet another example of why you should never use APIs for random number generation that rely upon and mutate hidden global state, like the functions in numpy.random. Instead, use APIs that explicitly deal with RNG state, e.g., by calling methods on an explicitly created numpy.random.Generator object. JAX takes this one step further: there are no mutable RNG objects at all, and the users has to explicitly manipulate RNG state with pure functions.

It’s a little annoying to have to set and pass RNG state explicitly, but on the plus side you never hit these sorts of issues. Your code will also be completely reproducible, without any chance of spooky “action at a distance.” Once you’ve been burned by this a few times, you’ll never go back.

You might think that explicitly seeding the global RNG would solve reproducibility issues, but it really doesn’t. If you call into any code you didn’t write, it might also be using the same global RNG.

1 comments

The solution you suggest is irrelevant to the issue mentioned in the article. Even if you use np.random.RandomState, or any other "explicit RNG state", that state will still be copied in the fork() call.

The post just stresses that one should be careful when using random states and multiprocessing, so you should either reseed after forking or using multiprocess/multithread-aware RNG API.

I believe the point is that the error will be more obvious if the state is passed around explicitly.
Possibly but this is the kind of boilerplate which people tend to ignore, especially when a program is non-trivial. It’s really easy to notice if you’re doing something like `seed_rng(); fork();` but once there’s distance and more than one thing being passed around I’d be surprised if you didn’t find the same pattern, perhaps a bit less common.

Fundamentally, there two problems: fork() is a performance trick to try to do setup only once and seeding an RNG is a type of setup which isn’t intuitively obvious can’t be optimized that way; and if most people learn from a tutorial or quick start this is exactly the kind of important but non core issue people omit or ignore in that context.

Additionally, I think people make a hidden assumption that they don't even realize they're making: that when you ask for random numbers from numpy, they're more or less "true" random numbers, not seeded ones. Like, I think the intention of the programmers is just "give me a bunch of random numbers, I don't really care how as long as they're random", and assumes that that is what that numpy function does. But it doesn't: it provides you a pseudo-random sequence – not true randomness – so of course the sequence is identical after the fork.

Like, they think they're reading from /dev/random, but they're not: they're just running rand() (metaphorically speaking).

Definitely - back when I supported a computational neuroscience group that came up multiple times (not numpy but similar contexts), along with the various quirks around floating point math. Even experienced people do things like that because they’re focused on the actual problem and this is a leaky implementation detail.