| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tinix 3233 days ago

FWIW, I routinely throw many GBs of pickled dataframes into Redis all the time, and then cluster the workload between multiple processes that are coordinated as a sort of namespaced job queue, all via Redis pubsub, blpop, l/rpush, and set/get. There are much faster and more efficient serialization formats like msgpack or protocol buffers however, compared to pickle, if you really need to squeeze out performance. You just have to chunk your bulk out into pieces and spread the bulk across multiple workers. You have an orchestrator class that puts things onto the queues, pulls things off, loads any modules you need, handles exceptions, etc...

Then you can namespace your queues (and workers), and have separate queues for results handling to push data to the next stage of the pipeline, etc... With stacks of workers, configured as needed. It's all pretty high level from there. GIL has no effect here, and as a side-effect, now you can utilize a massive number of parallel processes for heavy lifting and crunching, even on different machines over the network, where-as that wouldn't be possible with a traditional threaded architecture.

Not saying this necessarily covers your use-case, but it seems strange to use dataframes as a sort of in-memory database, vs using dataframes as the framing to do the munging and heavy lifting. What are you wanting to put multiple cursors on it or something? You could do this with greenlets, for what it's worth... But as someone who has gone down that route (multiple greenlets working over shared stack) I promise doing it with multiple processes and a queue is better, and ultimately way more flexible. Especially if you use something like msgpack or protocol buffers... Then you can have any workers from multiple programming languages and development paradigms doing different work at different stages, all orchestrated and working together via Redis.