| When working with ETL, it really helps to not conflate the letters or worry about them in the wrong order. A lot of the most insane complexity comes out of moving too quickly with data. If you don't have good staging data after running extraction (i.e., a 1:1 view of the source system data available in your database), there is nothing you can do to help with this downstream. You should stop right there and keep digging. Extracting the data should be the most challenging aspect of an ETL pipeline. It can make a lot of sense to write custom software to handle this part. It is worth the investment because if you do the extraction really well, the transform & load stages can happen as a combined afterthought [0,1,2,3] in many situations. This also tends to be one of the fastest ways to deal with gigantic amounts of data. If you are doing things like pulling 2 different tables and joining them in code as part of your T/L stages, you are really missing out on the power of views, CTEs, TVFs, merge statements, etc. [0] https://learn.microsoft.com/en-us/sql/t-sql/statements/merge... [1] https://www.postgresql.org/docs/current/sql-merge.html [2] https://docs.oracle.com/database/121/SQLRF/statements_9017.h... [3] https://www.ibm.com/docs/en/db2/12.1?topic=statements-merge |
Why should this be difficult? It’s the easiest part. You run SELECT * and you’re done.
The difficult part is transforming all the disparate upstream systems and their evolving schemas into a useful analytical model for decision support.