|
|
|
|
|
by atombender
786 days ago
|
|
The "Merkle tree" algorithm here isn't using a Merkle tree, it's just a binary partitioning algorithm. The point of a Merkle tree is that it's a tree of hashes. Also, it doesn't really solve the consistency problem the author claims is the biggest problem; yes, over time it will correct for Elasticsearch's eventual consistency, but in the short run it's just as bad as pagination. I don't know the author's application, but I question the desire to get a consistent dump from Elasticsearch in the first place. It is very not much intended to be a "source of truth", so you're better off streaming the data from your original data source, which is presumably something like an SQL database. That said, if you want a stable snapshot of an entire index — where your requirement is to not ever miss documents due to concurrent updates — then you can use Elasticsearch's snapshot support. Each snapshot is just that, a read-only snapshot of the data, allowing consistent reads. The eventual consistency problem that the article describes is solved by refreshing the index. You can use "refresh=wait_for" when doing an update in order to wait for Elasticsearch to make the update searchable. You can also force a refresh. Any subsequent query will return the newest indexed data. Since 6.x, Elasticsearch has had docvalue pagination via "search_after", which allows pagination without a durable cursor. Each cursor value is the docvalue set of the last seen document. This is consistent insofar as the set of source documents is consistent (so it's not safe against concurrent updates). There's essentially little need to use "_scroll" or offset-based pagination anymore. |
|