Hacker News new | ask | show | jobs
by shepardrtc 2053 days ago
I've been building a program that heavily uses multiprocessing for the past few months. It works quite well, but it did take me a little bit to figure out the best way to work with it.

> - Threads can't be used efficiently because of the GIL

Python's "threads" are actually fibers. Once you shift your thought process toward that then its easy enough to work with them. Async is a better solution, though, because "threads" aren't smart when switching between themselves. Async makes concurrency smart.

But if you want to use real threads, multiprocessing's "processes" are actually system threads.

> - multiprocesses has to serialize everything in a single thread often killing performance. (Unless you use shared memory space techniques, but that's less than ideal compared to threads)

I'm not quite sure what you mean. Multiprocessing's processes have their own GIL and are "single-threaded", but you can still spawn fibers and more processes from them, as well as use async.

Or are you talking about using the Manager and namespaces to communicate between processes? That is a little slow, yes. High speed code should probably use something else. Most programs will be fine with it, but it is way slower than rolling your own solution. However, it does work easily, so that's something to be said about it. Shared memory space techniques do work, too, but they are a little obtuse. Personally, I rolled my own data structures using the multiprocessing primitives. You have to set them up ahead of time, but they're insanely fast. Or you can use redis pubsub for IPC. Or write to a memory-mapped file.

- You can't use multiprocess while inside a multiprocess executor. This makes building things on top of frameworks/libs that use multiprocess a nightmare... e.g try to use a web server like over something like Keras...

I'm not sure what you mean. Multiprocessing simply spawns other Python processes. You can spawn processes from processes, so I don't know why you would have issues. Perhaps communication is an issue?

> - The dependency ecosystem is a pita

Yes, absolutely.

2 comments

> Python's "threads" are actually fibers.

They’re actually not. They are native threads with high lock contention.

Async is arguably fibers, as are greenthreads in libraries like gevent or eventlet.

> But if you want to use real threads, multiprocessing's "processes" are actually system threads.

They’re system threads running in separate memory spaces. Also known as… processes.

You're right. To me they just feel like fibers because they can't run in parallel.
If you use numba (or cython, c extensions, etc) you can make them run without requiring that they hold the GIL, and they can run in parallel. Here's an example that should keep a CPU pegged at 100% utilization for a while:

  import numba as nb
  from concurrent.futures import ThreadPoolExecutor
  from multiprocessing import cpu_count

  @nb.jit(nogil=True)
  def slow_calculation(x):
      out = 0
      for i in range(x):
          out += i**0.01
      return out

  ex = ThreadPoolExecutor(max_workers=cpu_count())
  futures = [ex.submit(slow_calculation, 100_000_000_000+i) for i in range(cpu_count())]
> and they can run in parallel.

Even without requiring the GIL, these are still child threads of the main process, correct? And because of that, wouldn't the OS keep them all on the same core? And if that's the case, would ProcessPoolExecutor solve that problem?

I had no idea that existed, thank you!
No man. Python threads are not fibers. This is factually wrong. Please Read: https://wiki.python.org/moin/GlobalInterpreterLock