|
|
|
|
|
by wenc
908 days ago
|
|
Great article. I've worked with Parquet files on S3 for years, but I didn't quite understand what Iceberg was, but the article explained it well. It's a database metadata format for an underlying set of data which describes its schema, partitioning etc. Most people use Hive partitioning convention (i.e. directory names like /key3=000/key2=002/) but Iceberg goes farther than this by exposing even more structure to the query engine. In a traditional DBMS like Postgres, the schema, the query engine and the storage format come as a single package. But with big data, we're building database components from scratch, and we can mix and match. We can use Iceberg as a metadata format, DuckDB as the query engine, Parquet as the storage format, and S3 as the storage medium. |
|