Breaking it up. Instead of running a query for the entire year, run it month by month and stitch the final results back together.
Or if you have a unique string ID, calculate an integer hash using hash(ID) % 50 to get 50 chunks which you can process separately without OOMing.
A basic assumption is that all the chunks are independent of each other. Chunking is essentially temporary partitioning to fit your processing limitations.
I had the same idea in my mind, but I am struggle to understand efficient implementation.
Say, I have large table A, which I want to group by.
I can chunk it into small tables, for example:
for i in range(64): execute('create table_%i as select * from A where hash(id) % 64 = %i')
but this would mean scanning source table 64 times which is likely not efficient, I am wondering if duckdb has some functionality which can do more efficient chunking?
It's not doing a full table scan of the entire source table 64 times -- it's only scanning the id column (DuckDB + Parquet is a columnar store) and retrieving the corresponding rows from other columns. It's pretty fast.
Or if you have a unique string ID, calculate an integer hash using hash(ID) % 50 to get 50 chunks which you can process separately without OOMing.
A basic assumption is that all the chunks are independent of each other. Chunking is essentially temporary partitioning to fit your processing limitations.