|
|
|
|
|
by andrewprock
950 days ago
|
|
It's cleaner to say that ELTP is really just two ETL steps done in sequence. ETL1: gather the raw data from the data source, mapping it to the schema required to load it into the data store. ETL2: pull the normalized data, process it in some way, and load into a downstream data store. I suppose that ETL is typically bound to getting data into a warehouse, but that feels like a largely arbitrary distinction. We are just moving data from source to sink. |
|
Here are my thoughts though on the importance of the distinction...
The first place you land the data is almost always a place you control - either a data warehouse or a data lake that you have tuned for fast and flexible data processing. The second (publish) process pushes to a location you most likely can't control, and which is not prepared to receive raw/unshaped data.
This is important because the business logic in our transformations will almost always evolve over time. Running between EL and P (the second "EL") gives us reproducibility and efficiency to innovate, using the location we have the best performance profile for running those transforms.
What do you think?