Hacker News new | ask | show | jobs
by albertzeyer 775 days ago
The problem is, once you access such shared objects in Python, it is never readonly access but actually read-write, because it modifies the refcount. The problem is also described here: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multip...

But also, you say you would prefer such a unbound memory access hack instead of using a global variable?

But also, why does it need to be a global variable? When you fork(), afterwards all the local variables are available to the child process. No need for global variables.

1 comments

> The problem is, once you access such shared objects in Python, it is never readonly access but actually read-write, because it modifies the refcount.

That is right, but is a mere drop in the sea. First, because reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked), meaning you will mainly copy on write these external small structures anyway. Second, what I'm describing here is for when pickling objects across workers is prohibitively slow and memory consuming, typically that means sharing pandas dataframes of dozens or hundreds of gigabytes. Some copied refcount pages here and these is really not going to be a culprit.

> But also, why does it need to be a global variable? When you fork(), afterwards all the local variables are available to the child process. No need for global variables.

Right, but you need some way to access these variables, and once you're in a worker process you simply are in a difference scope.

    def workerfunc(x):
        # I'm a poor worker in an empty scope

    def parent():
        juicy_variable = ...
        with Pool(42) as pool:
            result = pool.map(workerfunc, [1, 2, 3])
> reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked)

That's wrong. That was never the case.

Recent CPython: https://github.com/python/cpython/blob/6d419db10c84cacbb3862...

CPython 2.0: https://github.com/python/cpython/blob/2a9b0a93091b9ef7350a9...

CPython 0.9.8: https://github.com/python/cpython/blob/dd104400dc551dd4098f3...

Regarding multiprocessing.Pool, that would not work as I said. I was thinking more about a plain fork, like this:

    def parent():
        juicy_variable = ...

        def workerfunc(x):
            # I can access juicy_variable
            ...

        childs = []
        for i in [1, 2, 3]:
            child = fork()
            if child == 0:
                workerfunc(i)
                sys.exit()
            childs.append(child)

        # Wait for and cleanup childs.
        # Communicate somehow with childs to get back results.
        ...
>> reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked)

> That's wrong. That was never the case.

You are right, I was mistaken.

The point stands though, PyObjects are not really an issue for use cases where these tricks are needed.

> I was thinking more about a plain fork, like this

Right well you can recreate a multiprocessing pool of your own with different pros and cons, sure, that's an other approach I guess.