Hacker News new | ask | show | jobs
by zbentley 2862 days ago
> Any resources on how to implement that?

  big_data = read_huge_binary_or_string()

  def process_range(rng):
    start, end = rng
    do_something(big_data[start:end])

  pool = multiprocesing.Pool(2)
  pool.map(process_range, [
    (0, 10000),
    (10001, len(big_data),
  ])
2 comments

Also, after doing some research:

The `multiprocessing.Pool` uses a `multiprocessing.Queue` in the background to retrieve the results after completion.

The `multiprocessing.Queue` in turn uses `multiprocessing.connection.Pipe` and sends the pickled objects over to the wire.

So I don't see how this is any better than ZMQ.

Just because stuff has an API that doesn't look like message passing doesn't mean it can't be doing that in the background. Which is funny, because that's the whole point of ZProc.

I realize the subtle difference that Cpython uses pipes, not sockets, unlike ZMQ. But that doesn't really make a difference now, does it?

Proof:

Process Pool worker, returning the result by using `outqueue.put()`

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

multiprocessing Queue, initializing a Pipe

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

multiprocessing Queue serializing data to send it using that Pipe

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

No pipes or queues are used as part of the example code above. It transfers the large piece of data without serialization.

The point of the original post is that MP lets you do more than just serialize/ship data around after pool start time; there are substantial optimizations you can do if you know lots of the data you need to process early on.

Right, that only concerns sending data at startup, which both Python (and zproc) already do.

I thought you were talking about sending data to child processes in constant* time, while it was running.