|
|
|
|
|
by ritchie46
1260 days ago
|
|
It does. Though the functionality is quite new, we will extend this. Calling `collect(streaming=True)` on a `LazyFrame` will allow you to process datasets that don't fit into memory. This currently works for groupbys, joins, many functions, filter etc. We will extend this to sorts and likely other operations as well. |
|
I know its a boring use case, but the challenge with it is that it is a complete waste of money and carbon footprint to use Spark to process a 20 MB CSV or table with few thousand records, but tools like Pandas fall apart when you hit a 50 GB CSV or table with few billion records.
Something more efficient (say, in Rust and not Python or Java) and yet scalable (due to not fitting everything into memory) would be a great help here.