| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wenc 908 days ago

Great article. I've worked with Parquet files on S3 for years, but I didn't quite understand what Iceberg was, but the article explained it well. It's a database metadata format for an underlying set of data which describes its schema, partitioning etc.

Most people use Hive partitioning convention (i.e. directory names like /key3=000/key2=002/) but Iceberg goes farther than this by exposing even more structure to the query engine.

In a traditional DBMS like Postgres, the schema, the query engine and the storage format come as a single package.

But with big data, we're building database components from scratch, and we can mix and match. We can use Iceberg as a metadata format, DuckDB as the query engine, Parquet as the storage format, and S3 as the storage medium.

1 comments

3abiton 908 days ago

Very grateful of your recap, I skimmed through the article fast, but got a better understanding reading your comment!

link