|
Given that ElasticSearch is "eventually consistent", how do you know when it has caught up? How do you know what data is missing once the count is wrong? It's solveable, of course, but it's a problem that pops up with any synchronization system, and I'm surprised nobody (apparently) has written one, because it requires a fairly good state machine that can also compute diffs. Once a store grows to a certain state, you do not want to trigger full syncs, ever. The best, most trivial (in terms of complexity and resilience) solution I have found is to sync data in batches, give each batch an ID, and record each batch both in the target (eg., ElasticSearch) and in a log that belongs to the synchronization process. The heuristic is then to compute a difference beetween the two logs to see how far you need to catch up. This will only work in a sequential fashion; if ElasticSearch loses random documents, it won't be picked up by such a system. You could fix this by letting each log entry store the list of logical updates, checksummed; and then do regular (eg., nightly) "inventory checks". |
That's nice, in practice, though, ElasticSearch, doesn't behave like an eventually consistent system--it behaves like a flawed fully consistent system. It doesn't self-repair enough to be eventually consistent. If you get out of sync by more than a few seconds, you're going to have to repair the system manually in some fashion. It never "catches up."
Additionally, also in practice, most of the data loss (and real eventual consistency behavior) you'll see in an ElasticSearch+primary-data-store system isn't coming from within ES--it's coming from queues people typically use in the sync process. So there's a degree where you're going to need to handle this on an application-specific basis.
> How do you know what data is missing once the count is wrong?
In practice, people just ignore it or do a full re-index. Theoretically, you should be building merkle trees.
> Once a store grows to a certain state, you do not want to trigger full syncs, ever.
This is not really true. You need to maintain enough capacity for full syncs, because someone will need to do schema changes and/or change linguistic features in the search index.