| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pletnes 1260 days ago
	I actually thought polars’ lazy api would allow for out-of-ram computation? Also dask is more flexible than spark, since it lets you deal with numpy arrays and arbitrary objects better than spark can.

2 comments

ritchie46 1260 days ago

It does. Though the functionality is quite new, we will extend this.

Calling `collect(streaming=True)` on a `LazyFrame` will allow you to process datasets that don't fit into memory. This currently works for groupbys, joins, many functions, filter etc.

We will extend this to sorts and likely other operations as well.

link

glogla 1260 days ago

I'm curious if you could use this not for data science tasks but for data engineering tasks - say read a csv or pull a table from oracle and store it as delta lake table or something.

I know its a boring use case, but the challenge with it is that it is a complete waste of money and carbon footprint to use Spark to process a 20 MB CSV or table with few thousand records, but tools like Pandas fall apart when you hit a 50 GB CSV or table with few billion records.

Something more efficient (say, in Rust and not Python or Java) and yet scalable (due to not fitting everything into memory) would be a great help here.

link

ritchie46 1260 days ago

This is exactly what we are aiming for. There are already a lot of queries that can be processed with 100s GBS of data on my 16GB laptop.

And we will extend functionality for out of core processing. A single node can do a lot!

link

mooncitizen 1260 days ago

Can you please put an example about dask being more flexible than spark?

link

IanOzsvald 1260 days ago

This https://docs.dask.org/en/stable/spark.html notes "However, Dask is able to easily represent far more complex algorithms and expose the creation of these algorithms to normal users [compared to spark]" linking to: http://matthewrocklin.com/blog/work/2015/06/26/Complex-Graph...

link

pletnes 1260 days ago

Yes, say you’ve got imaging data in 3 (or higher) dimensions as numpy arrays and want to run some sort of algorithm on multiple cores/machines. Could be both for data analytics and simulations.

dask.bag has generic parallel processing capabilities. Query a database, a rest api, something. Then merge into dataframes across dask workers.

link