Hacker News new | ask | show | jobs
by snidane 1939 days ago
Source data should not experience deletes or updates, otherwise backfills will not work. Deletes can be handled by mirroring source data. Updates are difficult and will need a full CDC system to capture them. Better is to negotiate with data provider to send data updates as appends and never to delete from history.

The whole point of ETL is to bring data from one database to another. The comparison of source and destination primary keys can be done in python outside of db. And should be done on entire partitions instead of individual rows. Eg. you only consider which 'day' partitions have been loaded, not which rows have been loaded.

1 comments

That kind of approach is fine for special cases like time series or logs or events, but "no updates or deletes" is never going to be true for arbitrary data.

"Negotiating with data provider" is never going to happen - SAP or IBM or whatever vendor of whatever you're integrating is not going to change how their systems work to make your life easier - more likely they would use it as an opportunity to pitch their reporting solution instead.

Meaning from simple data movement, you get need for CDC on source end, then the simple incremental movement, then deduplication on target end - and that one is pretty computationally expensive.

For small data and low refresh frequencies (like singular gigabytes in source size, so hundreds of megabytes in columnar, updated daily), this dance might not be worth it compared to daily full snapshots.

I wish you were right though, my life would be hella easier.

We are probably refering to different scenarios. When purchasing data for analytics, data providers are usually sophisticated enough to know not to modify their data history. With new ones, data delivery format can be negotiated.

Data providers usually wait for a day or something worth of data to collect before validating and releasing it to customers.

For integrating some OLTP database updating in real time on the other hand, yes you will need CDC.

---

Most of data engineering is just incrementally adding new data to existing corpus and then running a big batch job to dedup, sort or partition. This last step surely is computationally expensive, but at least it is conceptually simple and can be solved by throwing hardware at it. The first part of incremental updates is what imo causes more troubles.