Hacker News new | ask | show | jobs
by hendiatris 476 days ago
This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.

What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.

3 comments

Have you written about your parquet strategy anywhere? Or have suggested reading related to the tuning you've done? Super interested.
Also very interested in the parquet tuning. I have been building my data lake and most optimization I do is just with efficient partitioning.
I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.

Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...

What query engine are you using?

Tends to be that an optimal file size for Parquet is about 1GiB, once again, the "many small files" problem of Hadoop remains.

Then it's things like, can you organise your data in such a way to take advantage of RLE etc.?

Either Spark or Redshift (serverless)
Parquet tuning has always been like that, ever since it first came out in 2013.

I worry with Iceberg that people think it's just a case of "use an Iceberg table in Snowflake" and boom, amazingly fast querying of data in S3!

how nested is the data in the parquet files?