| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by glogla 1260 days ago

I'm curious if you could use this not for data science tasks but for data engineering tasks - say read a csv or pull a table from oracle and store it as delta lake table or something.

I know its a boring use case, but the challenge with it is that it is a complete waste of money and carbon footprint to use Spark to process a 20 MB CSV or table with few thousand records, but tools like Pandas fall apart when you hit a 50 GB CSV or table with few billion records.

Something more efficient (say, in Rust and not Python or Java) and yet scalable (due to not fitting everything into memory) would be a great help here.

1 comments

ritchie46 1260 days ago

This is exactly what we are aiming for. There are already a lot of queries that can be processed with 100s GBS of data on my 16GB laptop.

And we will extend functionality for out of core processing. A single node can do a lot!

link