I think a lot of this complexity can be avoided by just writing single threaded python and using GNU parallel for running it on multiple cores. You can even trivially distribute the work across a cluster that way.
This is the approach I've taken, albeit at the "top level" of the program. Since I know I don't have to deal with Windows I much prefer simply piping to parallel instead of xargs, or calling make -j8, or similarly letting some shell wrapper handle it over dealing with the overhead inside of python, especially multiprocessing.
However, where I think having this stuff available inside of python is useful is that it's cross platform and consumable from "higher levels" of python. A library can do some mucky stuff internally to speed computation but still present a simple sync interface, all without external dependencies.
However, where I think having this stuff available inside of python is useful is that it's cross platform and consumable from "higher levels" of python. A library can do some mucky stuff internally to speed computation but still present a simple sync interface, all without external dependencies.