| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by LunaSea 960 days ago

Are you not getting OutOfMemory erros?

We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.

Setting the memory limit setting or the disk swap file has not worked.

3 comments

wenc 960 days ago

Not since 0.9.0. There’s been a lot of work on out of core stuff especially with big joins. It’s in the release notes. Also if you set a temp folder it will spill to disk (not set by default).

I used to have to chunk my data to avoid OOMs but I haven’t had to do that.

That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.

Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.

https://duckdb.org/2023/09/26/announcing-duckdb-090.html

link

riku_iki 960 days ago

> If you still get OOMs, chunking your data will help.

what does chunking data mean here?..

link

wenc 960 days ago

Breaking it up. Instead of running a query for the entire year, run it month by month and stitch the final results back together.

Or if you have a unique string ID, calculate an integer hash using hash(ID) % 50 to get 50 chunks which you can process separately without OOMing.

A basic assumption is that all the chunks are independent of each other. Chunking is essentially temporary partitioning to fit your processing limitations.

link

riku_iki 960 days ago

I had the same idea in my mind, but I am struggle to understand efficient implementation.

Say, I have large table A, which I want to group by. I can chunk it into small tables, for example:

for i in range(64): execute('create table_%i as select * from A where hash(id) % 64 = %i')

but this would mean scanning source table 64 times which is likely not efficient, I am wondering if duckdb has some functionality which can do more efficient chunking?

link

wenc 959 days ago

It's not doing a full table scan of the entire source table 64 times -- it's only scanning the id column (DuckDB + Parquet is a columnar store) and retrieving the corresponding rows from other columns. It's pretty fast.

link

vgt 960 days ago

DuckDB 0.9 has fixed most of OOM scenarios by both using far less RAM and by spilling to disk better.

(Head of Produck at MotherDuck and a huge fan of DuckDB)

link

ayhanfuat 960 days ago

I encountered the same issue. Polars' memory usage was much lower for the datasets I tried.

link

LunaSea 957 days ago

Problems I encountered with polar was that the immutability meant that copies of the dataframe were generated and not cleaned up fast enough.

link