Hacker News new | ask | show | jobs
by devxpy 2863 days ago
Performance doesn't equal Better software.

In fact, I think Performance centric development is a lesser known evil.

> have all your data before creating your processes/pool

Zproc exposes the required API for this (Nothing new, just the python API) :)

https://zproc.readthedocs.io/en/latest/api.html#zproc.Proces... (args and kwargs)

> a massive dataset

Wouldn't you be better off using a Database for that kind of work?

> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant time

Any resources on how to implement that?

1 comments

> Any resources on how to implement that?

  big_data = read_huge_binary_or_string()

  def process_range(rng):
    start, end = rng
    do_something(big_data[start:end])

  pool = multiprocesing.Pool(2)
  pool.map(process_range, [
    (0, 10000),
    (10001, len(big_data),
  ])
Also, after doing some research:

The `multiprocessing.Pool` uses a `multiprocessing.Queue` in the background to retrieve the results after completion.

The `multiprocessing.Queue` in turn uses `multiprocessing.connection.Pipe` and sends the pickled objects over to the wire.

So I don't see how this is any better than ZMQ.

Just because stuff has an API that doesn't look like message passing doesn't mean it can't be doing that in the background. Which is funny, because that's the whole point of ZProc.

I realize the subtle difference that Cpython uses pipes, not sockets, unlike ZMQ. But that doesn't really make a difference now, does it?

Proof:

Process Pool worker, returning the result by using `outqueue.put()`

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

multiprocessing Queue, initializing a Pipe

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

multiprocessing Queue serializing data to send it using that Pipe

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

No pipes or queues are used as part of the example code above. It transfers the large piece of data without serialization.

The point of the original post is that MP lets you do more than just serialize/ship data around after pool start time; there are substantial optimizations you can do if you know lots of the data you need to process early on.

Right, that only concerns sending data at startup, which both Python (and zproc) already do.

I thought you were talking about sending data to child processes in constant* time, while it was running.