Hacker News new | ask | show | jobs
by jasonjmcghee 475 days ago
Have you written about your parquet strategy anywhere? Or have suggested reading related to the tuning you've done? Super interested.
1 comments

Also very interested in the parquet tuning. I have been building my data lake and most optimization I do is just with efficient partitioning.
I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.

Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...

What query engine are you using?

Tends to be that an optimal file size for Parquet is about 1GiB, once again, the "many small files" problem of Hadoop remains.

Then it's things like, can you organise your data in such a way to take advantage of RLE etc.?

Either Spark or Redshift (serverless)