Hacker News new | ask | show | jobs
by anko 1414 days ago
I have solved a similar problem, in a similar way and i've found polars <https://www.pola.rs/> to solve this quite well without needing clickhouse. It has a python library but does most processing in rust, across multiple cores. I've used it for data sets up to about 20GB no worries, but my computer's ram became the issue, not polars itself.
1 comments

We were using 500+gb of memory at peak and were expecting that to grow. If I remember we didn't go with Polars because we needed to run custom apply functions on DataFrames. Polars had them but the function took a tuple (not a DF or dict) which when you've got 20+ columns makes for really error prone code. Dask and Spark both supported a batch transform operation so the function took a Pandas Dataframe as input and output.