Hacker News new | ask | show | jobs
by spankalee 1829 days ago
It looks like adding and removing documents before the end of the current batch may cause /existing/ documents to be skipped or processed twice.

If you add a new document before the end of the current batch, the offset used for the beginning of the next batch will be too low, causing documents at the boundary to be processed twice. If you delete a document the index will be too high, skipping some documents.

I think the temporary field solution might work, but you need stable indexing on the set to be traversed, so I think you need to add the temporary field to new documents and exclude them in the query, and you need to only soft-delete while traversing and exclude them post-query. Then you can clean up and remove the temporary fields and soft-deleted documents afterwards.

1 comments

Are you sure about this? Not sure what you mean by "offset" but we are passing the last document of the current batch to the .startAfter() method which ensures that the next batch only contains the docs that come after that. So there shouldn't be any doc that is processed twice. But as I said earlier, the new docs won't be traversed which is expected. I'm working on a different type of traverser that will fix this.

I'll actually write up some tests to confirm that we don't process any docs twice!