Hacker News new | ask | show | jobs
by Galanwe 776 days ago
Sharing objects is completely fine upon a forking multiprocessing pool. Each worker process ends up with a (lazy) copy of the parent address space, including the whole python interpreter state.

When you think about it, you already rely on that to access imported modules, global variables, etc in your worker processes. The GIL and reference counting has nothing to care about it either, as they are both copied from the parent process as well, thus you can freely continue to read global state (and even modify, though it will be local to the worker process).

That is, if you have large objects to share with your worker processes, and you don't need to get them back, you can freely just assign them as global variables before starting your multiprocessing pool, and have your workers read them, at zero transfer cost, and safely. This _trick_ has been used since decades in Python.

Now the annoying thing is, that global variable mumbo jumbo is kind of dirty and ad-hoc to maintain. Using custom pickable weakrefs (using something similar to the trick with id() in this repo), you can create proxy weak objects to transfer to your workers, thus achieving the same effect, while keeping a "function argument" interface, instead of proxy8ng through global variables.

2 comments

The problem is, once you access such shared objects in Python, it is never readonly access but actually read-write, because it modifies the refcount. The problem is also described here: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multip...

But also, you say you would prefer such a unbound memory access hack instead of using a global variable?

But also, why does it need to be a global variable? When you fork(), afterwards all the local variables are available to the child process. No need for global variables.

> The problem is, once you access such shared objects in Python, it is never readonly access but actually read-write, because it modifies the refcount.

That is right, but is a mere drop in the sea. First, because reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked), meaning you will mainly copy on write these external small structures anyway. Second, what I'm describing here is for when pickling objects across workers is prohibitively slow and memory consuming, typically that means sharing pandas dataframes of dozens or hundreds of gigabytes. Some copied refcount pages here and these is really not going to be a culprit.

> But also, why does it need to be a global variable? When you fork(), afterwards all the local variables are available to the child process. No need for global variables.

Right, but you need some way to access these variables, and once you're in a worker process you simply are in a difference scope.

    def workerfunc(x):
        # I'm a poor worker in an empty scope

    def parent():
        juicy_variable = ...
        with Pool(42) as pool:
            result = pool.map(workerfunc, [1, 2, 3])
> reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked)

That's wrong. That was never the case.

Recent CPython: https://github.com/python/cpython/blob/6d419db10c84cacbb3862...

CPython 2.0: https://github.com/python/cpython/blob/2a9b0a93091b9ef7350a9...

CPython 0.9.8: https://github.com/python/cpython/blob/dd104400dc551dd4098f3...

Regarding multiprocessing.Pool, that would not work as I said. I was thinking more about a plain fork, like this:

    def parent():
        juicy_variable = ...

        def workerfunc(x):
            # I can access juicy_variable
            ...

        childs = []
        for i in [1, 2, 3]:
            child = fork()
            if child == 0:
                workerfunc(i)
                sys.exit()
            childs.append(child)

        # Wait for and cleanup childs.
        # Communicate somehow with childs to get back results.
        ...
>> reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked)

> That's wrong. That was never the case.

You are right, I was mistaken.

The point stands though, PyObjects are not really an issue for use cases where these tricks are needed.

> I was thinking more about a plain fork, like this

Right well you can recreate a multiprocessing pool of your own with different pros and cons, sure, that's an other approach I guess.

Do you have any publicly available code demonstrating this pattern?
I don't actually, but it can be explained in a few lines of code. Consider the following two simple functions:

    def ref(obj):
        return id(obj)

    def deref(addr):
        import ctypes
        return ctypes.cast(addr, ctypes.py_object).value
Basically, this relies on an implementation detail of `id()` in CPython: the unique id of an object is its memory address. `ref()` returns a reference to an object (think `&` in C), and `deref()` dereferences it back (think `*` in C). This is close to the standard `weakref` module in essence, but weakref is a black box.

Now even though the callstack is cleared upon fork of the worker processes, you still have the parent objects available, and properly tracked and refcounted, as you can check from `gc.get_objects()`. This is in fact a feature of `gc` as explained in the doc (https://docs.python.org/3/library/gc.html):

> If a process will fork() without exec(), avoiding unnecessary copy-on-write in child processes will maximize memory sharing and reduce overall memory usage. This requires both avoiding creation of freed “holes” in memory pages in the parent process and ensuring that GC collections in child processes won’t touch the gc_refs counter of long-lived objects originating in the parent process. To accomplish both, call gc.disable() early in the parent process, gc.freeze() right before fork(), and gc.enable() early in child processes.

Now whenever you want to send large objects to a `multiprocessing.Pool` or `concurrent.futures.ProcessPoolExecutor`, you can avoid expensive pickling by just sending these references.

    class BigObject: pass

    def child(rbo):
        bo = deref(rbo)
        return bo.compute_something()

    def parent():
        bo1 = BigObject()
        bo2 = BigObject()
        with Pool(2) as pool:
            result = pool.map(child, [ref(bo1), ref(bo2)])
In a real codebase though, there are some caveats around this. You cannot take the reference of just anything, there are temporaries, cached small integers, etc. You will need some form of higher level wrapper around `ref()` to properly choose when and what to reference or to copy.

Also it may be inconvenient to have your child functions explicitely dereference their parameters, it will force you to write _dereference wrappers_ around your original functions. A good strategy I've used is to create a proxy class that stores a reference and override `__getstate__`/`__setstate__` for pickling itself as reference and unpickling itself as a proxy. That way, you can transparently pass these proxies to your original functions without any modification.

Oh, I see. You want to avoid serializing the objects since they will be copied anyway with fork(), but the parent needs a way to refer to a particular object when talking to the child, so it needs to pass some kind of ID.

You could also do it without pointers and ctypes by using e.g. an array index as the ID:

    inherited_objects = []

    def ref(obj):
        object_id = len(inherited_objects)
        inherited_objects.append(obj)
        return object_id

    def deref(object_id):
        return inherited_objects[object_id]
Although this part needs a small change as well, so that the object ID is assigned before forking:

    def parent():
        bo1 = BigObject()
        bo2 = BigObject()
        refs = list(map(ref, [bo1, bo2]))
        with mp.Pool(2) as pool:
            result = pool.map(child, refs)
> You want to avoid serializing the objects since they will be copied anyway with fork(), but the parent needs a way to refer to a particular object when talking to the child, so it needs to pass some kind of ID.

Yes, that is exactly and succintely the crux of the idea :-)

As you found out, you can rely on indices or keys in a global object to achieve the same result. The annoying part though is that you need to pre-provision these objets before the pool, and clean them after to avoid keeping references to them. That means some explicit boilerplate every time you use a pool.

The nice thing with the id() trick is that it's very unintrusive for the caller, as the reference count stays the same in the parent process, it is only increased in the child, unbeknownst to the parent.