Hacker News new | ask | show | jobs
by LunaSea 960 days ago
Are you not getting OutOfMemory erros?

We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.

Setting the memory limit setting or the disk swap file has not worked.

3 comments

Not since 0.9.0. There’s been a lot of work on out of core stuff especially with big joins. It’s in the release notes. Also if you set a temp folder it will spill to disk (not set by default).

I used to have to chunk my data to avoid OOMs but I haven’t had to do that.

That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.

Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.

https://duckdb.org/2023/09/26/announcing-duckdb-090.html

> If you still get OOMs, chunking your data will help.

what does chunking data mean here?..

Breaking it up. Instead of running a query for the entire year, run it month by month and stitch the final results back together.

Or if you have a unique string ID, calculate an integer hash using hash(ID) % 50 to get 50 chunks which you can process separately without OOMing.

A basic assumption is that all the chunks are independent of each other. Chunking is essentially temporary partitioning to fit your processing limitations.

I had the same idea in my mind, but I am struggle to understand efficient implementation.

Say, I have large table A, which I want to group by. I can chunk it into small tables, for example:

for i in range(64): execute('create table_%i as select * from A where hash(id) % 64 = %i')

but this would mean scanning source table 64 times which is likely not efficient, I am wondering if duckdb has some functionality which can do more efficient chunking?

It's not doing a full table scan of the entire source table 64 times -- it's only scanning the id column (DuckDB + Parquet is a columnar store) and retrieving the corresponding rows from other columns. It's pretty fast.
DuckDB 0.9 has fixed most of OOM scenarios by both using far less RAM and by spilling to disk better.

(Head of Produck at MotherDuck and a huge fan of DuckDB)

I encountered the same issue. Polars' memory usage was much lower for the datasets I tried.
Problems I encountered with polar was that the immutability meant that copies of the dataframe were generated and not cleaned up fast enough.