Hacker News new | ask | show | jobs
by georgewfraser 1944 days ago
Great article, one quibble: there isn’t really a clear dividing line between batch and streaming. If you process data one row at a time, that is clearly a streaming pipeline, but most systems that call themselves streaming actually process data in small batches. From a user perspective, it’s an implementational detail, the only thing you care about is the latency target.

Nearly all data sources, including the changelogs of databases, are polling-based APIs, so you’re getting data from the source in (small) batches. If your goal is to put this data into a data warehouse like Snowflake, or a system like Materialize, the lowest latency thing you can do is just immediately put that data into the destination. I sometimes see people put a message broker like Kafka in the middle of this process, thinking it’s going to imbue the system with some quality of streamyness, but this can only add latency. People are often surprised that we don’t use a message broker at Fivetran, but when you stop and think about it there’s just no benefit in this context.

6 comments

> If you process data one row at a time, that is clearly a streaming pipeline, but most systems that call themselves streaming actually process data in small batches. From a user perspective, it’s an implementational detail, the only thing you care about is the latency target.

Author here. 100% agreed.

As an aside, I just came across your post about how Databricks is an RDBMS [0]. I recently wrote a similar article from a slightly more abstract perspective [1].

Having worked heavily with RDBMSs in the first part of my career, I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling. And that was part of my inspiration for this post about data pipelines.

[0] https://fivetran.com/blog/databricks-is-an-rdbms

[1] https://nchammas.com/writing/modern-data-lake-database

https://www.confluent.io/blog/turning-the-database-inside-ou... might be interesting if you haven't seen it already. The way I see it RDBMSes already had most of the tech, but they wrap it up in an opaque black box that can only be used one way, like those automagical frameworks that generate a full webapp from a couple of class definitions but then fall apart as soon as you want to do something slightly specialised. So the data revolution is really about unbundling all the cool RDBMS tech and moving from a shrinkwrapped database product to something more like a library that gives you full control over what happens when, letting you integrate your business logic wherever it makes sense.
That talk by Martin Kleppmann is fantastic.

Another, closely related article is "The Log: What every software engineer should know about real-time data's unifying abstraction" by Jay Kreps.

https://engineering.linkedin.com/distributed-systems/log-wha...

> I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling

So much this.

I work on Flow [0] at Estuary [1]; you might be interested in what we're doing. It offers millisecond latency continuous datasets that are _also_ plain JSON files in cloud storage -- a real-time data lake -- which can be declaratively transformed and materialized into other systems (RDBMS, key/value, even pub/sub) using precise, continuous updates. It's the materialized view model you articulate, and Flow is in essence composing data lake-ing with continuous map/combine/reduce to make it happen.

I was asked the other day if Flow "is a database" by someone who only wanted a 2-3 sentence answer, and I waffled badly. The very nature of "databases" are so fuzzy today. They're increasingly unboxed across the Cambrian explosion of storage, query, and compute options now available to us. S3 and friends for primary storage; on-demand MPP compute for query and transformation; wide varieties of RDBMSs, key/value stores, OLAP systems, even pub/sub as specialized indexes for materializations. Flow's goal, in this worldview, is to be a hypervisor and orchestrator for this evolving cloud database. Not a punchy elevator pitch, but there it is.

[0] https://estuary.readthedocs.io/en/latest/README.html [1] https://estuary.dev/

Strong concur, the idea of "database" just doesn't make sense anymore, I've (d?)evolved to "storage, streaming, and compute" (as it seems you do as well) in all my discussions.

Have you read ThoughtWorks material on the "data mesh"? Sounds like your product is looking to be a part of that kind of new data ecology.

https://martinfowler.com/articles/data-mesh-principles.html

It is in definitely anticlimactic after data warehousing, data lakes, data lakehouses, etc. to just throw up your hands and say "data whenever you want it, wherever you want it, in whatever form you want it" (at whatever price you're willing to pay!) So I feel your pain on marketing your product,but I think the next 5 years or so will be heavily focused on automating data quality and standardized pipelines, computational governance and optimizing workloads, and intelligent "just in time" materializaton, caching, HTAP, etc.

Your big play in my mind is helping customers optimize the (literal financial) cost/benefit tradeoff of all those compute and query engines.

Can somebody explain Data Mesh in simple terms? I've watched numerous videos and read the article, but still can't understand what it's supposed to say.

Is it basically to set up one data lake per analytics team, instead of a big centralized data lake?

It seems to me this is already being done from my experience. Each team has their own s3 bucket in which they store their data products and other teams can consume them using their favorite data engines as long as they can read parquet files from s3.

Or am I missing something?

It's basically a set of 4 principals or best practices: 1. decentralized data ownership. 2. Data as a product. 3. Self-serve data infrastructure. 4. Federated governance. So I don't think of it as some new product thing but rather principals that work well for people with a certain type of use case. It's a bit like Microservices in that regard.
Interesting project! (The interactive slides are cool btw.)

Could you share a bit about how engineers express data transformations in Flow?

From a quick look at the docs, it doesn't look like you are using SQL to do this, which is interesting since it bucks the general trend in data tooling.

You write pure-function mappers using any imperative language. We're targeting TypeScript initially for $reasons, followed by ergonomic OpenAPI / WASM support, but anything that can support JSON => JSON over HTTP will do.

It's only on you to write mapper functions -- combine & reduce happens automatically, using reduction annotations of the collection's JSON-schema. Think of it as a SQL group-by where aggregate functions have been hoisted to the schema itself.

Here's a worked-up narrative example using Citi Bike system data: https://estuary.readthedocs.io/en/latest/examples/citi-bike/...

| it bucks the general trend in data tooling

You can say that again. It's a bet, and I don't know how it will work out. I do think SQL is a poor fit for expressing long-lived workflows that evolve over joined datasets / schemas / transformations, or which are more operational vs analytical in nature, or which want to bring in existing code. Though I _completely_ understand why others have focused on SQL first, and it's definitely not an either / or proposition.

Ultimatelly there will never be a pure streaming system processing, one record at a time, in real world. Any such system contains a busy loop somewhere inside polling the source, say each 100ms, and unless it shares a lock with the source system, it will never guarantee that there won't be more items in the source queue within those 100ms intervals. Therefore all such systems are at best (micro)batch systems. Also streaming systems literally batch data into time windows when doing, eg. group by operation, so they turn into batch systems then.

Pure batch systems are those where the processing window is infinite and no state is preserved. Everything is recomputed from scratch on every run. This seems to be the prefered way to do ETL because dragging state around and accidentally polluting it is better to be avoided if not handled properly.

What is more useful for real world data processing would be an "incremental batch" model, in which the processing system has a memory of what it has processed so far and after comparing that against source data, it would determine what will run in the next update batch.

Sadly, the industry is plagued with either pure streaming solutions, even though most data problems are not of this nature. Or ETL and workflow systems, which are thinking in terms of pure batching model. This results in me having to implement the necessary logic for incremental loads myself while not finding these ETL frameworks very useful.

I've honestly had more luck writing scripts myself than relying on excessively complicated frameworks for ETLing out there. They only seem to convolute stuff together like Ruby on Rails back in the days, instead of separating concerns like some small http library or web microframework.

Is there anything out there on the horizon which focuses on incremental batch processing, or as the article point out, updating materialized views that I manage myself?

How would this look like? Specifically, how do you know if something has been deleted? Do you compare the primary keys in your materialized view (the last snapshot you have of the data) with the source data to know what changed? Isn't that really hard to do if they're not in the same database?

In real life most people prefer taking a full snapshot each day because they don't have good solutions to these problems in batch systems (CDC is another story).

Source data should not experience deletes or updates, otherwise backfills will not work. Deletes can be handled by mirroring source data. Updates are difficult and will need a full CDC system to capture them. Better is to negotiate with data provider to send data updates as appends and never to delete from history.

The whole point of ETL is to bring data from one database to another. The comparison of source and destination primary keys can be done in python outside of db. And should be done on entire partitions instead of individual rows. Eg. you only consider which 'day' partitions have been loaded, not which rows have been loaded.

That kind of approach is fine for special cases like time series or logs or events, but "no updates or deletes" is never going to be true for arbitrary data.

"Negotiating with data provider" is never going to happen - SAP or IBM or whatever vendor of whatever you're integrating is not going to change how their systems work to make your life easier - more likely they would use it as an opportunity to pitch their reporting solution instead.

Meaning from simple data movement, you get need for CDC on source end, then the simple incremental movement, then deduplication on target end - and that one is pretty computationally expensive.

For small data and low refresh frequencies (like singular gigabytes in source size, so hundreds of megabytes in columnar, updated daily), this dance might not be worth it compared to daily full snapshots.

I wish you were right though, my life would be hella easier.

We are probably refering to different scenarios. When purchasing data for analytics, data providers are usually sophisticated enough to know not to modify their data history. With new ones, data delivery format can be negotiated.

Data providers usually wait for a day or something worth of data to collect before validating and releasing it to customers.

For integrating some OLTP database updating in real time on the other hand, yes you will need CDC.

---

Most of data engineering is just incrementally adding new data to existing corpus and then running a big batch job to dedup, sort or partition. This last step surely is computationally expensive, but at least it is conceptually simple and can be solved by throwing hardware at it. The first part of incremental updates is what imo causes more troubles.

I think something like singer.io can be "microframework of ETL" or rather ELT which is a better idea anyway. But of course, it has its own challenges.
> there’s just no benefit in this context

I'd beg to differ; having a message broker as part of the streaming pipeline allows to set up multiple consumers (e.g. Snowflake and Elasticsearch and something else). Also, depending on the settings, you can replay streams from the beginning.

That's why we see >90% of Debezium users running with Kafka or alternatives. There is a group of point-to-point users (e.g. caching use cases) mostly via Debezium engine, but it's a small minority.

I'd say that the difference was the systems behavior in the presence of IO, and it's pretty important in my experience. Micro-batching systems hold up processing while waiting for IO but proper streaming implementations continue using cpu for elements at other points in the stream, very roughly.
Exactly. You can have the amortization benefits of batching without adding extra latency, if you do optimistic pipelining. It's super powerful if your transaction sizes aren't linear, e.g. because you're doing in-memory aggregations.

Gazette consumers use this strategy: https://gazette.readthedocs.io/en/latest/consumers-concepts....

Do you have any links you would recommend reading for designing real-time data/reporting solutions?
It is amusing how often we think adding work to the system will speed it up.

It is interesting when is works. :)