Hacker News new | ask | show | jobs
by plasma 1076 days ago
A technique I've used before is to treat Elasticsearch as rebuildable at any time, consider this approach:

A cron runs every 5 minutes that looks at your database for any objects you're indexing where last_modified_at timestamp > last_indexing_started_timestamp.

Index the objects into Elasticsearch, then update the last_indexing_started_timestamp value to be when you started the original sync process, so we catch any modified objects between the start/end of the update run, next run.

Then if Elasticsearch needs rebuilding you can just clear out the last indexing timestamp and resync from the start of time, and its self-recovering / won't get out of sync.

3 comments

I've done exactly this for about a decade and it has worked supremely well. Robust and resilient because it's simple and idempotent.

In my case I'm using Solr and my last_indexed field isn't written to until the Solr index call completes without error. I have a very basic lock on the indexing process which hasn't failed me yet, and if it ever did fail the consequences would only be wasted CPU cycles. I consider that a lower risk than updating last_indexed only to have the actual indexing fail unexpectedly.

In the rare instances I've needed to re-index from scratch the process has been incredibly simple:

1. Start a new instance of Solr on a powerful AWS instance and direct index updates to it

2. Set all last_indexed fields to NULL

3. Wait for the scheduled task to complete the re-indexing

4. Reboot the new Solr instance on a sufficient AWS instance

5. Shift to the new Solr instance for search engine reads

Tried this as well but the difficulty is that Postgres is a relational database whereas ElasticSearch stores schema-less documents.

Your record in ES might include data from many different tables, and figuring out what to (efficiently) update when there is a change in Postgres is not a simple task.

For me, a shotgun approach seemed the least likely to break.

Anything that is a dependency in the elastisearch index should trigger a job to export to it. And since it is idempotent it doesn't matter if it accidentally exports two or ten times the same index in a bg job. Just make sure before writing that you do a quick check that you're not overriding a fresher one. So just have a freshness timestamp which is the latest timestamp of any record used in the indexing data.

Furthermore you can do a daily job to just re export a critical part of the index. Doesn't matter if it is or isn't fresh. So let's say you query all records that were modified in the last day, and trigger the export job thatnmaynincludebthat record. Even if it causes duicate work. Idempotency saves you there.

Perhaps include a "last modified" timestamp w/ timezone in tables of interest, PG can update this on a trigger so no app code has to change. Index this field. Then build a view on top of the relevant tables that assembles the document for ES. Include in the view a field which contains the most recent of all the "last modified" dates, and filter the view on that timestamp?
We've solved this ("any of these N related tables needs to update the search record") but building reactivity into our backend ORM:

https://joist-orm.io/docs/advanced/full-text-search

Granted, currently we still do pgsearch against this derived field, but could sync it over to ES.

This issue comes up in several domains and tech layers (back to front, bottom up etc.)

Are there any generic, algorithmic or even just heuristics that help with this?

It’s something I‘ve been thinking about over some time now. Any pointers, strategies and tips are appreciated.

Or we can also use Logstash