Hacker News new | ask | show | jobs
by Galanwe 775 days ago
I don't actually, but it can be explained in a few lines of code. Consider the following two simple functions:

    def ref(obj):
        return id(obj)

    def deref(addr):
        import ctypes
        return ctypes.cast(addr, ctypes.py_object).value
Basically, this relies on an implementation detail of `id()` in CPython: the unique id of an object is its memory address. `ref()` returns a reference to an object (think `&` in C), and `deref()` dereferences it back (think `*` in C). This is close to the standard `weakref` module in essence, but weakref is a black box.

Now even though the callstack is cleared upon fork of the worker processes, you still have the parent objects available, and properly tracked and refcounted, as you can check from `gc.get_objects()`. This is in fact a feature of `gc` as explained in the doc (https://docs.python.org/3/library/gc.html):

> If a process will fork() without exec(), avoiding unnecessary copy-on-write in child processes will maximize memory sharing and reduce overall memory usage. This requires both avoiding creation of freed “holes” in memory pages in the parent process and ensuring that GC collections in child processes won’t touch the gc_refs counter of long-lived objects originating in the parent process. To accomplish both, call gc.disable() early in the parent process, gc.freeze() right before fork(), and gc.enable() early in child processes.

Now whenever you want to send large objects to a `multiprocessing.Pool` or `concurrent.futures.ProcessPoolExecutor`, you can avoid expensive pickling by just sending these references.

    class BigObject: pass

    def child(rbo):
        bo = deref(rbo)
        return bo.compute_something()

    def parent():
        bo1 = BigObject()
        bo2 = BigObject()
        with Pool(2) as pool:
            result = pool.map(child, [ref(bo1), ref(bo2)])
In a real codebase though, there are some caveats around this. You cannot take the reference of just anything, there are temporaries, cached small integers, etc. You will need some form of higher level wrapper around `ref()` to properly choose when and what to reference or to copy.

Also it may be inconvenient to have your child functions explicitely dereference their parameters, it will force you to write _dereference wrappers_ around your original functions. A good strategy I've used is to create a proxy class that stores a reference and override `__getstate__`/`__setstate__` for pickling itself as reference and unpickling itself as a proxy. That way, you can transparently pass these proxies to your original functions without any modification.

1 comments

Oh, I see. You want to avoid serializing the objects since they will be copied anyway with fork(), but the parent needs a way to refer to a particular object when talking to the child, so it needs to pass some kind of ID.

You could also do it without pointers and ctypes by using e.g. an array index as the ID:

    inherited_objects = []

    def ref(obj):
        object_id = len(inherited_objects)
        inherited_objects.append(obj)
        return object_id

    def deref(object_id):
        return inherited_objects[object_id]
Although this part needs a small change as well, so that the object ID is assigned before forking:

    def parent():
        bo1 = BigObject()
        bo2 = BigObject()
        refs = list(map(ref, [bo1, bo2]))
        with mp.Pool(2) as pool:
            result = pool.map(child, refs)
> You want to avoid serializing the objects since they will be copied anyway with fork(), but the parent needs a way to refer to a particular object when talking to the child, so it needs to pass some kind of ID.

Yes, that is exactly and succintely the crux of the idea :-)

As you found out, you can rely on indices or keys in a global object to achieve the same result. The annoying part though is that you need to pre-provision these objets before the pool, and clean them after to avoid keeping references to them. That means some explicit boilerplate every time you use a pool.

The nice thing with the id() trick is that it's very unintrusive for the caller, as the reference count stays the same in the parent process, it is only increased in the child, unbeknownst to the parent.