| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Shyaamal11 139 days ago

One thing I’ve seen with this pattern is that Postgres + CDC works really well as an early-stage streaming backbone, especially when the operational DB is already the source of truth.

Using WAL → CDC → downstream systems keeps the architecture simple at first, and tools like Debezium make it relatively straightforward to pipe those changes into Kafka or other processors.

Where things start getting interesting is the analytics side. Once the CDC stream lands in something like Iceberg tables, you effectively get a continuously updated analytical dataset that can be queried with engines like Spark or Trino.

At that point the architecture starts to look less like a traditional “data warehouse pipeline” and more like a streaming-first lakehouse where operational data flows directly into analytical storage.

The main challenge I’ve seen is operational complexity once you start combining: CDC ingestion stream processing lakehouse storage (Iceberg/Delta) distributed query engines That’s where platforms trying to package the open stack together (e.g. Spark + Iceberg + Trino) become interesting. Some newer platforms like IOMETE are basically trying to simplify running that type of lakehouse stack on Kubernetes so teams don’t have to glue everything together manually.

Curious where people think the breakpoint is at what scale does Postgres+CDC stop being “good enough” and you start needing a dedicated log system as the primary event backbone?