| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wenc 960 days ago
	Fantastic job by the DuckDB team. I’ve been using it for the past year to query 100s of GBs of Parquet files with complex analytic queries involving multiple levels of aggregations, joins and window functions and it all works and works fast. And I do all this from Jupyter Notebook. It’s actually faster than AWS Athena for me.

5 comments

marginalia_nu 960 days ago

Yeah I've started using it with my search engine as well. It's fantastic how versatile it is for data all manner of data munging.

Just the other day I used it to transform an unordered 60 GB CSV file with links and texts into a 3 GB parquet file that's so fast I can create a projection for the relevant data of each partition in like a minute (which then fits in memory).

It has some minor stability issues so I'm not sure I'd build a full blown application on top of it, but for data transformation tasks it's amazing.

link

ngrilly 960 days ago

That’s great! Where are the Parquet files stored, and where is running DuckDB?

link

wenc 960 days ago

S3 and SageMaker. Same region.

link

ngrilly 960 days ago

Then you have low latency between the SageMaker notebook running DuckDB and the S3 buckets storing the Parquet files: neat!

link

datadeft 960 days ago

Exactly the same experience here. I am thinking about getting rid off everything else in our data infra as soon as it hits 1.0.

link

cmollis 960 days ago

this part is confusing to me in the doc.. I assume that you're using the httpfs (S3) extensions and perhaps doing scanning of the parquet files (which I think is actually streamed.. e.g. querying for a specific column values in a series of parquet files). We have a huge data set of hive-partitioned parquet files in s3 (e.g. /customerid/year/month/<series of parquet files>). Can i just scan these files using the glob pattern to retrieve data like I can with Athena? The extension doc seems to indicate that I can (from the doc: SELECT * FROM read_parquet('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;) Or do I need to know which parquet files I'm looking for in S3 and bring them down to work on locally? If it's the former, then it seems equivalent to Athena..

link

wenc 960 days ago

No you can definitely use globs in DuckDB.

And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)

https://duckdb.org/docs/data/partitioning/hive_partitioning

link

cmollis 960 days ago

ok.. thanks.. I'll try it out. I can think of few use-case that we have where this might be a good alternative to athena.

link

LunaSea 960 days ago

Are you not getting OutOfMemory erros?

We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.

Setting the memory limit setting or the disk swap file has not worked.

link

wenc 960 days ago

Not since 0.9.0. There’s been a lot of work on out of core stuff especially with big joins. It’s in the release notes. Also if you set a temp folder it will spill to disk (not set by default).

I used to have to chunk my data to avoid OOMs but I haven’t had to do that.

That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.

Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.

https://duckdb.org/2023/09/26/announcing-duckdb-090.html

link

riku_iki 960 days ago

> If you still get OOMs, chunking your data will help.

what does chunking data mean here?..

link

wenc 960 days ago

Breaking it up. Instead of running a query for the entire year, run it month by month and stitch the final results back together.

Or if you have a unique string ID, calculate an integer hash using hash(ID) % 50 to get 50 chunks which you can process separately without OOMing.

A basic assumption is that all the chunks are independent of each other. Chunking is essentially temporary partitioning to fit your processing limitations.

link

riku_iki 959 days ago

I had the same idea in my mind, but I am struggle to understand efficient implementation.

Say, I have large table A, which I want to group by. I can chunk it into small tables, for example:

for i in range(64): execute('create table_%i as select * from A where hash(id) % 64 = %i')

but this would mean scanning source table 64 times which is likely not efficient, I am wondering if duckdb has some functionality which can do more efficient chunking?

link

wenc 959 days ago

It's not doing a full table scan of the entire source table 64 times -- it's only scanning the id column (DuckDB + Parquet is a columnar store) and retrieving the corresponding rows from other columns. It's pretty fast.

link

vgt 960 days ago

DuckDB 0.9 has fixed most of OOM scenarios by both using far less RAM and by spilling to disk better.

(Head of Produck at MotherDuck and a huge fan of DuckDB)

link

ayhanfuat 960 days ago

I encountered the same issue. Polars' memory usage was much lower for the datasets I tried.

link

LunaSea 957 days ago

Problems I encountered with polar was that the immutability meant that copies of the dataframe were generated and not cleaned up fast enough.

link