Hacker News new | ask | show | jobs
by andrewprock 950 days ago
It's cleaner to say that ELTP is really just two ETL steps done in sequence.

ETL1: gather the raw data from the data source, mapping it to the schema required to load it into the data store.

ETL2: pull the normalized data, process it in some way, and load into a downstream data store.

I suppose that ETL is typically bound to getting data into a warehouse, but that feels like a largely arbitrary distinction. We are just moving data from source to sink.

1 comments

Thanks for this feedback! I do agree there are some similarities as I called our as common benefits of using "EL pairs" on both sides of the process.

Here are my thoughts though on the importance of the distinction...

The first place you land the data is almost always a place you control - either a data warehouse or a data lake that you have tuned for fast and flexible data processing. The second (publish) process pushes to a location you most likely can't control, and which is not prepared to receive raw/unshaped data.

This is important because the business logic in our transformations will almost always evolve over time. Running between EL and P (the second "EL") gives us reproducibility and efficiency to innovate, using the location we have the best performance profile for running those transforms.

What do you think?

> What do you think?

I'm not convinced the distinction is important enough to warrant anything other than bucketing it under Reverse ETL, and the terms introduced (ELTP and "EL Pairs") I think create less clarity, not more.

> pushes to a location you most likely can't control

Even for internal data hand-offs, this is usually the case. Unless the same team is doing both the ETL work and building the app that's using the output, then the data team is delivering something that was signed-off by the receiving team.

> not prepared to receive raw/unshaped data

So like all Reverse ETL, which requires some sort of integration boundary for data delivery. That could be an API, or a CSV file uploaded to an FTP server, or reading schema'd JSON from Kafka. In every instance, the data team needs to tailor the output specific to the receiver.

I do like how the end to end pipeline is captured with ELTP. Conceptually I find it lighter than: ETL + Reverse ETL. While I might personally find modular ETL to be even lighter, that moniker is particular to myself and I wouldn't ask anyone else to take it up.

Regarding control, that's something I've never felt with production data. It's such a wild beast. Once the data leaves your team/code, all bets are off.