This is awesome. As our userbase grows I've been wondering how to best address this problem.
When we have a migration that requires more than 500 batch writes, we first got around that by building an array of batches with 500 writes per entry, then committing all of it at the end. Then we discovered the bulkWriter API in the node.js library which gets around the 500 batch limit. It looks like `createBatchMigrator` only supports up to 500 writes due to the firestore batch limit. Is that correct?
Thank you! Yes, that's correct. The limit for a batch migrator must be 500 because it uses Firestore write batches internally.
I'm currently writing another migrator that won't be using Firestore batches, it'll just use the good old Promise.all(). I'm planning to add more capabilities soon like error-resilient traversers using different traversal strategies, the ability to re-traverse the docs that couldn't be migrated the first time etc.
It looks like this will support more than 500 by calling the batch API repeatedly in a loop (the bulkWriter API is likely doing similar).
If you really want to address the issue of a growing userbase I'd highly recommend moving off of Firestore as soon as possible. It's really very inefficient at things like bulk updates (e.g. whereas in a SQL database you could use an UPDATE WHERE, in Direstore this is impossible without first reading every document than writing it back)
How does this handle concurrent modification of the collection, say if documents are added or deleted? Is there protection against skipping a document or traversing the same document twice?
Great question! Since traversing the entire collection may take a while, it's definitely possible that a new doc has been added in the meantime. Whether that new doc will be traversed or not depends on its order/index within the collection. It definitely won’t be traversed twice.
If it's positioned before all the docs in the current batch then it won't be traversed. If it's positioned after the current batch then it will be. So obviously that also depends on whether you’re traversing a plain collection or a Query.
Catching all the new docs that were added requires implementing a different strategy like adding a temporary field to all the traversed docs and then querying the ones that don’t have that field. It’s definitely something that we can implement soon!
It looks like adding and removing documents before the end of the current batch may cause /existing/ documents to be skipped or processed twice.
If you add a new document before the end of the current batch, the offset used for the beginning of the next batch will be too low, causing documents at the boundary to be processed twice. If you delete a document the index will be too high, skipping some documents.
I think the temporary field solution might work, but you need stable indexing on the set to be traversed, so I think you need to add the temporary field to new documents and exclude them in the query, and you need to only soft-delete while traversing and exclude them post-query. Then you can clean up and remove the temporary fields and soft-deleted documents afterwards.
Are you sure about this? Not sure what you mean by "offset" but we are passing the last document of the current batch to the .startAfter() method which ensures that the next batch only contains the docs that come after that. So there shouldn't be any doc that is processed twice. But as I said earlier, the new docs won't be traversed which is expected. I'm working on a different type of traverser that will fix this.
I'll actually write up some tests to confirm that we don't process any docs twice!
When we have a migration that requires more than 500 batch writes, we first got around that by building an array of batches with 500 writes per entry, then committing all of it at the end. Then we discovered the bulkWriter API in the node.js library which gets around the 500 batch limit. It looks like `createBatchMigrator` only supports up to 500 writes due to the firestore batch limit. Is that correct?