Hacker News new | ask | show | jobs
by mjdrogalis 1947 days ago
As someone who’s spent a lot of time working on data pipelines, I think this is a great breakdown of the complexity most data engineers are facing. However, I think there’s two more keys to tidying up messy pipelines in practice:

1. You need to colocate both stream processing for the data pipeline and real-time materialized view serving for the results.

2. You need one paradigm for expressing both of these things.

Let me try to describe a bit why that is.

1. You virtually always need both stream processing and view serving in practice. In the real-world, you ingest data streams from across the company and generally don’t have a say about how the data arrives. Before you can do the sort of materialization the author describes, you need to rearrange things a bit.

2. Building off of (1), if these two aren’t conceptually close, it becomes hard to make the whole system hang together. You still effectively have the same mess—it’s just spread over more components.

This is something we’re working really hard on solving at Confluent. We build ksqlDB (https://ksqldb.io/), an event streaming database over Kafka that:

1. Let’s you write programs that do stream processing and real-time materialized views in one place.

2. Let’s you write all of it in SQL. I see a lot of people on this post longing for bash scripting, and I get it. These frameworks are way too complicated today. But to me, SQL is the ideal medium. It’s both concise and deeply expressive. Way more people are competent with SQL, too.

3. Has built-in support for connecting to external systems. One other, more mundane part of the puzzle is just integrating with other systems. ksqlDB leverages the Kafka Connect ecosystem to plug into 120+ data systems.

You can read more about how the materialization pieces works in a recent blog I did. https://www.confluent.io/blog/how-real-time-materialized-vie...