|
|
|
|
|
by snidane
1939 days ago
|
|
Source data should not experience deletes or updates, otherwise backfills will not work. Deletes can be handled by mirroring source data. Updates are difficult and will need a full CDC system to capture them. Better is to negotiate with data provider to send data updates as appends and never to delete from history. The whole point of ETL is to bring data from one database to another. The comparison of source and destination primary keys can be done in python outside of db. And should be done on entire partitions instead of individual rows. Eg. you only consider which 'day' partitions have been loaded, not which rows have been loaded. |
|
"Negotiating with data provider" is never going to happen - SAP or IBM or whatever vendor of whatever you're integrating is not going to change how their systems work to make your life easier - more likely they would use it as an opportunity to pitch their reporting solution instead.
Meaning from simple data movement, you get need for CDC on source end, then the simple incremental movement, then deduplication on target end - and that one is pretty computationally expensive.
For small data and low refresh frequencies (like singular gigabytes in source size, so hundreds of megabytes in columnar, updated daily), this dance might not be worth it compared to daily full snapshots.
I wish you were right though, my life would be hella easier.