| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by glogla 1947 days ago

That kind of approach is fine for special cases like time series or logs or events, but "no updates or deletes" is never going to be true for arbitrary data.

"Negotiating with data provider" is never going to happen - SAP or IBM or whatever vendor of whatever you're integrating is not going to change how their systems work to make your life easier - more likely they would use it as an opportunity to pitch their reporting solution instead.

Meaning from simple data movement, you get need for CDC on source end, then the simple incremental movement, then deduplication on target end - and that one is pretty computationally expensive.

For small data and low refresh frequencies (like singular gigabytes in source size, so hundreds of megabytes in columnar, updated daily), this dance might not be worth it compared to daily full snapshots.

I wish you were right though, my life would be hella easier.

1 comments

snidane 1947 days ago

We are probably refering to different scenarios. When purchasing data for analytics, data providers are usually sophisticated enough to know not to modify their data history. With new ones, data delivery format can be negotiated.

Data providers usually wait for a day or something worth of data to collect before validating and releasing it to customers.

For integrating some OLTP database updating in real time on the other hand, yes you will need CDC.

---

Most of data engineering is just incrementally adding new data to existing corpus and then running a big batch job to dedup, sort or partition. This last step surely is computationally expensive, but at least it is conceptually simple and can be solved by throwing hardware at it. The first part of incremental updates is what imo causes more troubles.