Hacker News new | ask | show | jobs
by detroitcoder 3234 days ago
^This. It is a very common usecase for applications I work with to create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. If I use async workers, the dataframe operations are bound by GIL restraints. If I use sync workers, each process needs a copy of the pd dataframe which the server cannot handle (I have never seen pre-fork shared memory work for this problem). I don't want to introduce another technology to solve this problem.
6 comments

FWIW, I routinely throw many GBs of pickled dataframes into Redis all the time, and then cluster the workload between multiple processes that are coordinated as a sort of namespaced job queue, all via Redis pubsub, blpop, l/rpush, and set/get. There are much faster and more efficient serialization formats like msgpack or protocol buffers however, compared to pickle, if you really need to squeeze out performance. You just have to chunk your bulk out into pieces and spread the bulk across multiple workers. You have an orchestrator class that puts things onto the queues, pulls things off, loads any modules you need, handles exceptions, etc...

Then you can namespace your queues (and workers), and have separate queues for results handling to push data to the next stage of the pipeline, etc... With stacks of workers, configured as needed. It's all pretty high level from there. GIL has no effect here, and as a side-effect, now you can utilize a massive number of parallel processes for heavy lifting and crunching, even on different machines over the network, where-as that wouldn't be possible with a traditional threaded architecture.

Not saying this necessarily covers your use-case, but it seems strange to use dataframes as a sort of in-memory database, vs using dataframes as the framing to do the munging and heavy lifting. What are you wanting to put multiple cursors on it or something? You could do this with greenlets, for what it's worth... But as someone who has gone down that route (multiple greenlets working over shared stack) I promise doing it with multiple processes and a queue is better, and ultimately way more flexible. Especially if you use something like msgpack or protocol buffers... Then you can have any workers from multiple programming languages and development paradigms doing different work at different stages, all orchestrated and working together via Redis.

The pickling implementation of joblib has support for memory mapping numpy arrays nested in arbitrary data structures such as pandas dataframes.

Save the dataframe in a folder that can be accessed by the gunicorn worker:

    import joblib
    joblib.dump(df, '/folder/shared_data.pkl')
Then in the code run by the flask / gunicorn workers themselves:

    import joblib
    shared_df = joblib.load('/folder/shared_data.pkl', mmap_mode='r')
    # use the shared_df as usual (inplace modifications are not
    # authorized)
Some pandas function can have issues with read-only buffer though: https://github.com/pandas-dev/pandas/issues/17192 (caused by a currently unsolved bug / limitation of Cython) but it can work for your use case.
This looks very interesting. I am reading the docs https://pythonhosted.org/joblib/parallel.html#manual-managem... and it looks like it would help a lot (possibly solve the issue). Do you have any experience using this in production?
DAMN. I just did a basic test and it kinnda just worked?!? I created a test dataframe of 100M rows X 10 cols which took up ~2.3G and then used joblib.dump within the on_starting hook which is run when the gunicorn master starts up. Then loaded that df in with joblib.load within the worker and the total memory consumption was practically flat. Then I bumped up the number of workers to 20 and still flat. That is actually amazing. Coolest thing I have seen in months for how easy it is. Now I have to test out if the analytics actually work and a deep dive into the mechanics of mem-mapping.
Thanks for your feedback. I am glad I could help you.
> create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. [...]

May I ask what you consider large memory - MByte, GByte, TByte? The simplest solution is to store it as a blob on a SSD, and read it via simple file IO or put it into a DB. But I assume this was too slow, so it would be interesting to go into more details.

In the end you can do shared memory with multiprocessing in Python, which - I have to admit - requires some setup and bookkeeping work.

Lets say there are a couple dataframes that need a matrix multiply that take up about 10gb on a 32gb host. I want to parameterize these manipulations and expose over http. I can only afford to cache 3 sets of them, which means that I can perform 3 concurrent requests. I would like to provide more concurrency than this without reading from disk or storing the data out of process in a separate service which adds complexity.
Tried a memmap?
I still wish to find a good tutorial about memmap. The doc about it is very formal. Something with clear use cases, patterns, gotchas and best practice would probably make it more popular.
Check out the joblib.dump example mentioned above. It is pretty impressive so far.
I rarely hear people complain about genuine use-cases but this would seem to be one. However, aren't most/all of the dataframe operations done in C extensions in these cases?
While a lot of NumPy is C and Fortran, Pandas is mostly pure Python and some Cython. And mostly it does not release the GIL.

You often end up having to implement your own C extensions or use Numba for the core of your processing. Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.

> Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.

Is there any sort of list (comprehensive or otherwise) that denotes which NumPy functions are parallelism-friendly? I mean this whether it's in terms of releasing the GIL, in terms of SIMD support, or in terms of being multi-core.

Why are you asking this using a throwaway?

np.dot() is multicore. np.load () (and family) releases the GIL. SIMD mostly depends on the build system, so if you want it you might need to build NumPy from source.

https://stackoverflow.com/questions/24022723/where-can-i-fin...

Is there a way to disable this? In an HPC environment, I don't want routines going multi-core without my explicit permission, under any circumstances. I will already have manually set up the parallelization to be at the highest logical level. If using Python, that usually means I have planned out the number of processes to be equal to the number of cores. If each process then starts doing its own multicore calculation (badly load-balanced!) it overtaxes the node and slows everything down.

I really wish numpy/pandas/scipy wouldn't do this kind of uncontrollable parallelization.

Underlying implementations often have a way to disable parallelism, ie, OMP_NUM_THREADS=1 or MKL_NUM_THREADS=1
pd.HDFStore is a good option for storing large DataFrames and it has some power querying capabilities.