|
|
|
|
|
by detroitcoder
3234 days ago
|
|
^This. It is a very common usecase for applications I work with to create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. If I use async workers, the dataframe operations are bound by GIL restraints. If I use sync workers, each process needs a copy of the pd dataframe which the server cannot handle (I have never seen pre-fork shared memory work for this problem). I don't want to introduce another technology to solve this problem. |
|
Then you can namespace your queues (and workers), and have separate queues for results handling to push data to the next stage of the pipeline, etc... With stacks of workers, configured as needed. It's all pretty high level from there. GIL has no effect here, and as a side-effect, now you can utilize a massive number of parallel processes for heavy lifting and crunching, even on different machines over the network, where-as that wouldn't be possible with a traditional threaded architecture.
Not saying this necessarily covers your use-case, but it seems strange to use dataframes as a sort of in-memory database, vs using dataframes as the framing to do the munging and heavy lifting. What are you wanting to put multiple cursors on it or something? You could do this with greenlets, for what it's worth... But as someone who has gone down that route (multiple greenlets working over shared stack) I promise doing it with multiple processes and a queue is better, and ultimately way more flexible. Especially if you use something like msgpack or protocol buffers... Then you can have any workers from multiple programming languages and development paradigms doing different work at different stages, all orchestrated and working together via Redis.