Hacker News new | ask | show | jobs
by aruntdharan 1613 days ago
Thats a great question! We dont use updated timestamps to compute deltas, as thats unreliable and can cause data loss depending on your transaction window.

We keep snapshot tables on your data warehouse(in our own custom schema, so that you dont have to provide Castled write access to any of your production schemas). The snapshot tables are then used as the baseline to compare the query results everytime the pipeline runs. Frankly, we have not really seen a use case of transferring 5 billion rows in a Reverse ETL pipeline. This is mostly because of the fact that our destination apps are mostly transactional systems and cannot really store so much of data. For example, salesforce destination can store max 10GB of data. Because of this, we are storing the actual tuple values in the snapshot. We have easily scaled our pipelines to compute deltas from queries which returns up to 100 million records. To optimize this further, we are also considering to keep the hashes of the tuple values instead of the actual values.

Yes, we need to know the primary key of your query results. This is required to handle failures and to remove the failed records from the snapshot table, so that those can be retried on the next pipeline run.