|
I recently have been doing--what should be--straightforward subprocess work in Python, and the experience is infuriatingly bad. There are so many options for launching subprocesses and communicating with them, and each one has different caveats and undocumented limitations, especially around edge cases like processes crashing, timing out, killing them, if they are stuck in native code outside of the VM, etc. For example, some high-level options include Popen, multiprocessing.Process, multiprocessing.Pool, futures.ProcessPoolExecutor, and huge frameworks like Ray. multiprocessing.Process includes some pickling magic and you can pick from multiprocessing.Pipe and multiprocessing.Queue, but you need to use either multiprocessing.connection.wait() or select.select() to read the process sentinel simultaneously in case the process crashes. Which one? Well connection.wait() will not be interrupted by an OS signal. It's unclear why I would ever use connection.wait() then, is there some tradeoff I don't know about? For my use cases, process reuse would have been nice to be able to reuse network connections and such (useful even for a single process). Then you're looking at either multiprocessing.Pool or futures.ProcessPoolExecutor. They're very similar, except some bug fixes have gone into futures.ProcessPoolExecutor but not multiprocessing.Pool because...??? For example, if your subprocess exits uncleanly, multiprocessing.Pool will just hang, whereas futures.ProcessPoolExecutor will raise a BrokenProcessPool and the pool will refuse to do any more work (both of these are unreasonable behaviors IMO). Timing out and forcibly killing the subprocess is its own adventure for each of these too. I don't care about a result anymore after some time period passes, and they may be stuck in C code so I just want to whack the process and move on, but that is not very trivial with these. What a nightmarish mess! So much for "There should be one--and preferably only one--obvious way to do it"...my God. (I probably got some details wrong in the above rant, because there are so many to keep track of...) My learning: there is no "easy way to [process] parallelism" in Python. There are many different ways to do it, and you need to know all the nuances of each and how they address your requirements to know whether you can reuse existing high-level impls or you need to write your own low-level impl. |
Process is low-level and is almost never what you want. Pool is "mid-level", and usually isn't what you want. ProcessPoolExecutor is usually what you want, it is the "one obvious way to do it". That's not at all clear from the docs though.
The one obvious way to do it, in general, is: subprocess.run for running external processes, subprocess.Popen for async interaction with external processes, and concurrent.futures.ProcessPoolExecutor for Python multiprocessing.
Your other complaints about actually using the multiprocessing stuff are 100% valid. Error handling, cancellation, etc. is all very difficult. Passing data back and forth between the main process and subprocesses is not trivial.
But I do want to emphasize that there is a somewhat-well-defined gradient of lower- and higher-level tools in the standard library, and your "obvious way to do it" should usually start at the higher end of that gradient.
You might also want to look into the third-party Joblib library, which makes process parallelism a lot less painful for the straightforward use case of "run a function on a large amount of data, using multiple OS processes."