It does. Though the functionality is quite new, we will extend this.
Calling `collect(streaming=True)` on a `LazyFrame` will allow you to process datasets that don't fit into memory. This currently works for groupbys, joins, many functions, filter etc.
We will extend this to sorts and likely other operations as well.
I'm curious if you could use this not for data science tasks but for data engineering tasks - say read a csv or pull a table from oracle and store it as delta lake table or something.
I know its a boring use case, but the challenge with it is that it is a complete waste of money and carbon footprint to use Spark to process a 20 MB CSV or table with few thousand records, but tools like Pandas fall apart when you hit a 50 GB CSV or table with few billion records.
Something more efficient (say, in Rust and not Python or Java) and yet scalable (due to not fitting everything into memory) would be a great help here.
Yes, say you’ve got imaging data in 3 (or higher) dimensions as numpy arrays and want to run some sort of algorithm on multiple cores/machines. Could be both for data analytics and simulations.
dask.bag has generic parallel processing capabilities. Query a database, a rest api, something. Then merge into dataframes across dask workers.
Calling `collect(streaming=True)` on a `LazyFrame` will allow you to process datasets that don't fit into memory. This currently works for groupbys, joins, many functions, filter etc.
We will extend this to sorts and likely other operations as well.