| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by indoordin0saur 744 days ago
	Doesn't it sound like they'll try and move the two formats closer together so that there isn't such a format war? IDK how it would benefit Databricks to ruin either format if they're now such huge stakeholders in them both. Either way, I just want to know which format to pick. I've been chief data engineer at my current company for about a year and would like to be able to move off of plain parquet files in my lake but I'm not sure what table format to choose.

1 comments

waterlx 744 days ago

Hi, in case you did not find the answer yet. In my hamble opinion: - choose Iceberg: If you have several computing/query engines other than Spark, like Presto, Flink. Iceberg has a great extraction and design for a engine-independent table format. But its learning cost is relative high - choose Delta: If you only have Spark and would like to be deeply binded with Databricks - choose Hudi: If you would like to use data lake out-of-the-box and it is quite easy to use. - If your data is updated frequently, like streaming, check https://paimon.apache.org/ if you would like to be deeply binded with Flink

link

indoordin0saur 744 days ago

Thank you! Sounds like iceberg is the best then. I'm very allergic to lock-in. Currently we're very Spark heavy and our query engine is AWS Redshift Serverless. The recent AWS Glue Catalog support for Iceberg seems to make this promising.

link

ruipds 741 days ago

I heard from a AWS worker that they consider Iceberg to be the future. A lot of their services will be glued together with it.

link