|
Great article, one quibble: there isn’t really a clear dividing line between batch and streaming. If you process data one row at a time, that is clearly a streaming pipeline, but most systems that call themselves streaming actually process data in small batches. From a user perspective, it’s an implementational detail, the only thing you care about is the latency target. Nearly all data sources, including the changelogs of databases, are polling-based APIs, so you’re getting data from the source in (small) batches. If your goal is to put this data into a data warehouse like Snowflake, or a system like Materialize, the lowest latency thing you can do is just immediately put that data into the destination. I sometimes see people put a message broker like Kafka in the middle of this process, thinking it’s going to imbue the system with some quality of streamyness, but this can only add latency. People are often surprised that we don’t use a message broker at Fivetran, but when you stop and think about it there’s just no benefit in this context. |
Author here. 100% agreed.
As an aside, I just came across your post about how Databricks is an RDBMS [0]. I recently wrote a similar article from a slightly more abstract perspective [1].
Having worked heavily with RDBMSs in the first part of my career, I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling. And that was part of my inspiration for this post about data pipelines.
[0] https://fivetran.com/blog/databricks-is-an-rdbms
[1] https://nchammas.com/writing/modern-data-lake-database