|
Ultimatelly there will never be a pure streaming system processing, one record at a time, in real world. Any such system contains a busy loop somewhere inside polling the source, say each 100ms, and unless it shares a lock with the source system, it will never guarantee that there won't be more items in the source queue within those 100ms intervals. Therefore all such systems are at best (micro)batch systems. Also streaming systems literally batch data into time windows when doing, eg. group by operation, so they turn into batch systems then. Pure batch systems are those where the processing window is infinite and no state is preserved. Everything is recomputed from scratch on every run. This seems to be the prefered way to do ETL because dragging state around and accidentally polluting it is better to be avoided if not handled properly. What is more useful for real world data processing would be an "incremental batch" model, in which the processing system has a memory of what it has processed so far and after comparing that against source data, it would determine what will run in the next update batch. Sadly, the industry is plagued with either pure streaming solutions, even though most data problems are not of this nature. Or ETL and workflow systems, which are thinking in terms of pure batching model. This results in me having to implement the necessary logic for incremental loads myself while not finding these ETL frameworks very useful. I've honestly had more luck writing scripts myself than relying on excessively complicated frameworks for ETLing out there. They only seem to convolute stuff together like Ruby on Rails back in the days, instead of separating concerns like some small http library or web microframework. Is there anything out there on the horizon which focuses on incremental batch processing, or as the article point out, updating materialized views that I manage myself? |
In real life most people prefer taking a full snapshot each day because they don't have good solutions to these problems in batch systems (CDC is another story).