| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jasonjmcghee 475 days ago
	Have you written about your parquet strategy anywhere? Or have suggested reading related to the tuning you've done? Super interested.

1 comments

indoordin0saur 475 days ago

Also very interested in the parquet tuning. I have been building my data lake and most optimization I do is just with efficient partitioning.

link

hendiatris 475 days ago

I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.

Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...

link

EdwardDiego 475 days ago

What query engine are you using?

Tends to be that an optimal file size for Parquet is about 1GiB, once again, the "many small files" problem of Hadoop remains.

Then it's things like, can you organise your data in such a way to take advantage of RLE etc.?

link

indoordin0saur 469 days ago

Either Spark or Redshift (serverless)

link