Hacker News new | ask | show | jobs
by garganzol 217 days ago
DuckLake format has an unresolved built-in chicken and egg conflict: it requires SQL database to represent its catalog. But this is what some people are running away from when they choose Parquet format in the first place. Parquet = easy, SQL = hard, adding SQL to Parquet makes the resulting format hard. I would expect a catalog to be in Parquet format as well, then it becomes something self-bootstrapping and usable.
2 comments

DuckLake is more comparable to Iceberg and Delta than to raw parquet files. Iceberg requires a catalog layer too, a file system based one at its simplest. For DuckLake any RDBMS will do, including fs-based ones like DuckDB and SQLite. The difference is that DuckLake will use that database with all its ACID goodness for all metadata operations and there is no need to implement transactional semantics over a REST or object storage API.
It is not a chicken and egg problem, it is just a requirement to have an RDBMS available for systems like DuckLake and Hive to store their catalogs in. Metadata is relatively small and needs to provide ACID r/w => great RDBMS use case.
What about file-based catalogs with Iceberg? Found one that puts it in a single json file: https://github.com/boringdata/boring-catalog
Then concurrency suffers since you have to have locks when you update files.

That's also why ducklake performs better than others.

For many use cases this trade-off is worth it.