| On the data warehousing side, I think the story looks like this: 1) Cloud data warehouses like Redshift, Snowflake, and BigQuery proved to be quite good at handling very large datasets (petabytes) with very fast querying. 2) Customers of these proprietary solutions didn't want to be locked in. So many are drifting toward Iceberg tables on top of Parquet (columnar) data files. Another "hidden" motive here is that Cloud object stores give you regional (multi-zonal) redundancy without having to pay extra inter-zonal fees. An OLTP database would likely have to pay this cost, as it likely won't be based purely on object stores - it'll need a fast durable medium (disk), if at least for the WAL or the hot pages. So here we see the topology of Cloud object stores being another reason forcing the split between OLTP and OLAP. But how does this new world of open OLTP/OLAP technologies look like? Pretty complicated. 1) You'd probably run PostGres as your OLTP DB, as it's the default these days and scales quite well. 2) You'd set up an Iceberg/Parquet system for OLAP, probably on Cloud object stores. 3) Now you need to stream the changes from PostGres to Iceberg/Parquet. The canonical OSS way to do this is to set up a Kafka cluster with Kafka Connect. You use the Debezium CDC connector for Postgres to pull deltas, then write to Iceberg/Parquet using the Iceberg sink connector. This incurs extra compute, memory, network, and disk. There's so many moving parts here. The ideal is likely a direct Postgres->Iceberg write flow built-into PostGres. The pg_mooncake this company is offering also adds DuckDB-based querying, but that's likely not necessary if you plan to use Iceberg-compatible querying engines anyway. Ideally, you have one plugin for purely streaming PostGres writes to Iceberg with some defined lag. That would cut out the third bullet above. |
Yep. At the scope of a single table, append-only history is nice but you're often after a clone of your source table within Iceberg, materialized from insert/update/delete events with bounded latency.
There are also nuances like Postgres REPLICA IDENTITY and TOAST columns. Enabling REPLICA IDENTITY FULL amplifies you source DB WAL volume, but not having it means your CDC updates will clobber your unchanged TOAST values.
If you're moving multiple tables, ideally your multi-table source transactions map into corresponding Iceberg transactions.
Zooming out, there's the orchestration concern of propagating changes to table schema over time, or handling tables that come and go at the source DB, or adding new data sources, or handling sources without trivially mapped schema (legacy lakes / NoSQL / SaaS).
As an on-topic plug, my company tackles this problem. Postgres => Iceberg is a common use case.
[0] https://docs.estuary.dev/reference/Connectors/materializatio...