| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by snidane 1944 days ago

Most problems of data engineering of today would be solved in presence of a tool in which I would define arbitrary transformation of a say a single daily data increment and the system would handle the state management and loading of all of the increments. Regardless of if they came from source updates or backfills.

Data engineering really is just a maintenance of incrementally updated materialized views, but no tool out there yet recognizes it. They at best help you orchestrate and parallelize your ETLs across multiple threads and machines. They become glorified makefiles at the cost of introducing several layers of infrastructure into the picture (eg. Airflow) for what should have been solved by simple bash scripting.

Yet at best these tools only help with stateless batch processing. When it comes down to stateful processing, which is necessary for maintaining an incrementally updated materialized views and idempotent loads, I have to couple the logic of view state management (what has been loaded so far) with logic of the actual data transformation.

Response to difficulties of batch ETL from the industry is usually: batch data processing systems are resource hungry and slow, all you need from now is streaming.

No, actually I don't. For data analytics, pure streaming almost has no application. Data analytics is essentially data compression of big data to something smaller. Ie. some form of group by. I have to wait for a window of data to get close before computing anything useful. Analytics on real "real time" data on unclosed windows is confusing and useless.

So all data analytics will ever run on groups, windows and batches of data. Therefore I need a system which will help me run data transformations on batches. More precisely - stream of smaller batches. I need this to react to incoming daily, hourly or minutely batches and I need this to backfill my materialized view in the case I decide to wipe it off and start again.

You can literally do this in what was supposed to be the original system to orchestrate bunch of programs - shell scripting. And you'll be happier for it than using current complex frameworks. Only things you will miss is something to run distributed cron and to distribute load to multiple machines. At least the latter can be handled by gnu parallel.

This article hits the nail on its head with describing what conceptual model for ETL actually is and once others will follow, we might finally see new frameworks or just libraries to help us to greatly simplify ETLs. Perhaps one day data engineering will be just as simple as running an idempotent bash or python or sql script or even close to nonexistent.

2 comments

endymi0n 1944 days ago

https://www.getdbt.com/ comes extremely close in my eyes and even tackles the documentation and infra-as-code aspect. We went all in half a year ago and never looked back.

link

snidane 1944 days ago

DBT is interesting, but is far from what I'm describing.

1. It is only for structured SQL, not for arbitrary data. I can't use it to unpack raw zipped data for example

2. It couples logic for data transformation and view state management. Actually it makes you do it yourself, so it doesn't really help at all. You'll get burned by storing view state together with your data, eg. when a batch increment contains no data.

3. It is not built with "incremental materialized views" in mind. It still thinks in a batch refill mode and incremental mode according to this [0].

It is certaily an improvement over managing sql scripts by hand, but far from the ultimate goal of maintaining materialized views in a declarative way.

[0] https://docs.getdbt.com/docs/building-a-dbt-project/building...

link

smknappy 1944 days ago

Have a look @ https://www.ascend.io -- addresses the issues you highlight: 1. SQL + Python + Java + Scala 2. state management fully automated 3. automatic incremental materialized views

link

sammm 1944 days ago

How do you do testing out of interest? Whenever I have seen dbt used, it is usually data analysts creating new tables on the fly in data warehouse scenarios.

Maybe I am just too used to application developer workflows where models are defined in code and then there are ORMs and schema migration tools to help manage all that.

link

tehlike 1944 days ago

Ravendb gets close

link

snidane 1944 days ago

I had in mind OLAP use cases in environments with lot of both unstructured and structured tabular data. Some kind of scripting is necessary just to structure bunch of text and jsons into tables.

Ravendb seems like OLTP NoSQL database on the other hand.

link