|
|
|
|
|
by plasma
1076 days ago
|
|
A technique I've used before is to treat Elasticsearch as rebuildable at any time, consider this approach: A cron runs every 5 minutes that looks at your database for any objects you're indexing where last_modified_at timestamp > last_indexing_started_timestamp. Index the objects into Elasticsearch, then update the last_indexing_started_timestamp value to be when you started the original sync process, so we catch any modified objects between the start/end of the update run, next run. Then if Elasticsearch needs rebuilding you can just clear out the last indexing timestamp and resync from the start of time, and its self-recovering / won't get out of sync. |
|
In my case I'm using Solr and my last_indexed field isn't written to until the Solr index call completes without error. I have a very basic lock on the indexing process which hasn't failed me yet, and if it ever did fail the consequences would only be wasted CPU cycles. I consider that a lower risk than updating last_indexed only to have the actual indexing fail unexpectedly.
In the rare instances I've needed to re-index from scratch the process has been incredibly simple:
1. Start a new instance of Solr on a powerful AWS instance and direct index updates to it
2. Set all last_indexed fields to NULL
3. Wait for the scheduled task to complete the re-indexing
4. Reboot the new Solr instance on a sufficient AWS instance
5. Shift to the new Solr instance for search engine reads