Hacker News new | ask | show | jobs
by wenc 960 days ago
Fantastic job by the DuckDB team. I’ve been using it for the past year to query 100s of GBs of Parquet files with complex analytic queries involving multiple levels of aggregations, joins and window functions and it all works and works fast. And I do all this from Jupyter Notebook.

It’s actually faster than AWS Athena for me.

5 comments

Yeah I've started using it with my search engine as well. It's fantastic how versatile it is for data all manner of data munging.

Just the other day I used it to transform an unordered 60 GB CSV file with links and texts into a 3 GB parquet file that's so fast I can create a projection for the relevant data of each partition in like a minute (which then fits in memory).

It has some minor stability issues so I'm not sure I'd build a full blown application on top of it, but for data transformation tasks it's amazing.

That’s great! Where are the Parquet files stored, and where is running DuckDB?
S3 and SageMaker. Same region.
Then you have low latency between the SageMaker notebook running DuckDB and the S3 buckets storing the Parquet files: neat!
Exactly the same experience here. I am thinking about getting rid off everything else in our data infra as soon as it hits 1.0.
this part is confusing to me in the doc.. I assume that you're using the httpfs (S3) extensions and perhaps doing scanning of the parquet files (which I think is actually streamed.. e.g. querying for a specific column values in a series of parquet files). We have a huge data set of hive-partitioned parquet files in s3 (e.g. /customerid/year/month/<series of parquet files>). Can i just scan these files using the glob pattern to retrieve data like I can with Athena? The extension doc seems to indicate that I can (from the doc: SELECT * FROM read_parquet('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;) Or do I need to know which parquet files I'm looking for in S3 and bring them down to work on locally? If it's the former, then it seems equivalent to Athena..
No you can definitely use globs in DuckDB.

And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)

https://duckdb.org/docs/data/partitioning/hive_partitioning

ok.. thanks.. I'll try it out. I can think of few use-case that we have where this might be a good alternative to athena.
Are you not getting OutOfMemory erros?

We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.

Setting the memory limit setting or the disk swap file has not worked.

Not since 0.9.0. There’s been a lot of work on out of core stuff especially with big joins. It’s in the release notes. Also if you set a temp folder it will spill to disk (not set by default).

I used to have to chunk my data to avoid OOMs but I haven’t had to do that.

That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.

Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.

https://duckdb.org/2023/09/26/announcing-duckdb-090.html

> If you still get OOMs, chunking your data will help.

what does chunking data mean here?..

Breaking it up. Instead of running a query for the entire year, run it month by month and stitch the final results back together.

Or if you have a unique string ID, calculate an integer hash using hash(ID) % 50 to get 50 chunks which you can process separately without OOMing.

A basic assumption is that all the chunks are independent of each other. Chunking is essentially temporary partitioning to fit your processing limitations.

I had the same idea in my mind, but I am struggle to understand efficient implementation.

Say, I have large table A, which I want to group by. I can chunk it into small tables, for example:

for i in range(64): execute('create table_%i as select * from A where hash(id) % 64 = %i')

but this would mean scanning source table 64 times which is likely not efficient, I am wondering if duckdb has some functionality which can do more efficient chunking?

It's not doing a full table scan of the entire source table 64 times -- it's only scanning the id column (DuckDB + Parquet is a columnar store) and retrieving the corresponding rows from other columns. It's pretty fast.
DuckDB 0.9 has fixed most of OOM scenarios by both using far less RAM and by spilling to disk better.

(Head of Produck at MotherDuck and a huge fan of DuckDB)

I encountered the same issue. Polars' memory usage was much lower for the datasets I tried.
Problems I encountered with polar was that the immutability meant that copies of the dataframe were generated and not cleaned up fast enough.