Hacker News new | ask | show | jobs
by snidane 1944 days ago
Ultimatelly there will never be a pure streaming system processing, one record at a time, in real world. Any such system contains a busy loop somewhere inside polling the source, say each 100ms, and unless it shares a lock with the source system, it will never guarantee that there won't be more items in the source queue within those 100ms intervals. Therefore all such systems are at best (micro)batch systems. Also streaming systems literally batch data into time windows when doing, eg. group by operation, so they turn into batch systems then.

Pure batch systems are those where the processing window is infinite and no state is preserved. Everything is recomputed from scratch on every run. This seems to be the prefered way to do ETL because dragging state around and accidentally polluting it is better to be avoided if not handled properly.

What is more useful for real world data processing would be an "incremental batch" model, in which the processing system has a memory of what it has processed so far and after comparing that against source data, it would determine what will run in the next update batch.

Sadly, the industry is plagued with either pure streaming solutions, even though most data problems are not of this nature. Or ETL and workflow systems, which are thinking in terms of pure batching model. This results in me having to implement the necessary logic for incremental loads myself while not finding these ETL frameworks very useful.

I've honestly had more luck writing scripts myself than relying on excessively complicated frameworks for ETLing out there. They only seem to convolute stuff together like Ruby on Rails back in the days, instead of separating concerns like some small http library or web microframework.

Is there anything out there on the horizon which focuses on incremental batch processing, or as the article point out, updating materialized views that I manage myself?

2 comments

How would this look like? Specifically, how do you know if something has been deleted? Do you compare the primary keys in your materialized view (the last snapshot you have of the data) with the source data to know what changed? Isn't that really hard to do if they're not in the same database?

In real life most people prefer taking a full snapshot each day because they don't have good solutions to these problems in batch systems (CDC is another story).

Source data should not experience deletes or updates, otherwise backfills will not work. Deletes can be handled by mirroring source data. Updates are difficult and will need a full CDC system to capture them. Better is to negotiate with data provider to send data updates as appends and never to delete from history.

The whole point of ETL is to bring data from one database to another. The comparison of source and destination primary keys can be done in python outside of db. And should be done on entire partitions instead of individual rows. Eg. you only consider which 'day' partitions have been loaded, not which rows have been loaded.

That kind of approach is fine for special cases like time series or logs or events, but "no updates or deletes" is never going to be true for arbitrary data.

"Negotiating with data provider" is never going to happen - SAP or IBM or whatever vendor of whatever you're integrating is not going to change how their systems work to make your life easier - more likely they would use it as an opportunity to pitch their reporting solution instead.

Meaning from simple data movement, you get need for CDC on source end, then the simple incremental movement, then deduplication on target end - and that one is pretty computationally expensive.

For small data and low refresh frequencies (like singular gigabytes in source size, so hundreds of megabytes in columnar, updated daily), this dance might not be worth it compared to daily full snapshots.

I wish you were right though, my life would be hella easier.

We are probably refering to different scenarios. When purchasing data for analytics, data providers are usually sophisticated enough to know not to modify their data history. With new ones, data delivery format can be negotiated.

Data providers usually wait for a day or something worth of data to collect before validating and releasing it to customers.

For integrating some OLTP database updating in real time on the other hand, yes you will need CDC.

---

Most of data engineering is just incrementally adding new data to existing corpus and then running a big batch job to dedup, sort or partition. This last step surely is computationally expensive, but at least it is conceptually simple and can be solved by throwing hardware at it. The first part of incremental updates is what imo causes more troubles.

I think something like singer.io can be "microframework of ETL" or rather ELT which is a better idea anyway. But of course, it has its own challenges.