| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aaronsteers 959 days ago

Thanks for this feedback! I do agree there are some similarities as I called our as common benefits of using "EL pairs" on both sides of the process.

Here are my thoughts though on the importance of the distinction...

The first place you land the data is almost always a place you control - either a data warehouse or a data lake that you have tuned for fast and flexible data processing. The second (publish) process pushes to a location you most likely can't control, and which is not prepared to receive raw/unshaped data.

This is important because the business logic in our transformations will almost always evolve over time. Running between EL and P (the second "EL") gives us reproducibility and efficiency to innovate, using the location we have the best performance profile for running those transforms.

What do you think?

2 comments

tomnipotent 959 days ago

> What do you think?

I'm not convinced the distinction is important enough to warrant anything other than bucketing it under Reverse ETL, and the terms introduced (ELTP and "EL Pairs") I think create less clarity, not more.

> pushes to a location you most likely can't control

Even for internal data hand-offs, this is usually the case. Unless the same team is doing both the ETL work and building the app that's using the output, then the data team is delivering something that was signed-off by the receiving team.

> not prepared to receive raw/unshaped data

So like all Reverse ETL, which requires some sort of integration boundary for data delivery. That could be an API, or a CSV file uploaded to an FTP server, or reading schema'd JSON from Kafka. In every instance, the data team needs to tailor the output specific to the receiver.

andrewprock 959 days ago

I do like how the end to end pipeline is captured with ELTP. Conceptually I find it lighter than: ETL + Reverse ETL. While I might personally find modular ETL to be even lighter, that moniker is particular to myself and I wouldn't ask anyone else to take it up.

Regarding control, that's something I've never felt with production data. It's such a wild beast. Once the data leaves your team/code, all bets are off.