| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ogrisel 3232 days ago

The pickling implementation of joblib has support for memory mapping numpy arrays nested in arbitrary data structures such as pandas dataframes.

Save the dataframe in a folder that can be accessed by the gunicorn worker:

    import joblib
    joblib.dump(df, '/folder/shared_data.pkl')

Then in the code run by the flask / gunicorn workers themselves:

    import joblib
    shared_df = joblib.load('/folder/shared_data.pkl', mmap_mode='r')
    # use the shared_df as usual (inplace modifications are not
    # authorized)

Some pandas function can have issues with read-only buffer though: https://github.com/pandas-dev/pandas/issues/17192 (caused by a currently unsolved bug / limitation of Cython) but it can work for your use case.

1 comments

detroitcoder 3232 days ago

This looks very interesting. I am reading the docs https://pythonhosted.org/joblib/parallel.html#manual-managem... and it looks like it would help a lot (possibly solve the issue). Do you have any experience using this in production?

detroitcoder 3232 days ago

DAMN. I just did a basic test and it kinnda just worked?!? I created a test dataframe of 100M rows X 10 cols which took up ~2.3G and then used joblib.dump within the on_starting hook which is run when the gunicorn master starts up. Then loaded that df in with joblib.load within the worker and the total memory consumption was practically flat. Then I bumped up the number of workers to 20 and still flat. That is actually amazing. Coolest thing I have seen in months for how easy it is. Now I have to test out if the analytics actually work and a deep dive into the mechanics of mem-mapping.

ogrisel 3224 days ago

Thanks for your feedback. I am glad I could help you.