Hacker News new | ask | show | jobs
by diggs 3584 days ago
This approach works well enough for relatively small amounts of objects. Once you start getting in to the millions (and significantly higher) then it begins to break down. Every "sync" operation has to start from scratch, comparing source and target (possibly through an index) on a file by file basis. There are definitely faster ways of doing it that scale to much larger object counts, but then they have their own drawbacks.

It's a shame the S3 Api doesn't let you order by modified date, or this would be trivial to do efficiently.

1 comments

I'm curious if you can share how to synchronize N files without doing at least N comparisons.

the main innovations in s3s3mirror are (1) understanding this & going for massive parallelism to speed things up and (2) where possible, comparing etag/metadata instead of all bytes.

so far, it has scaled pretty well, i know of no faster tool to synchronize buckets with millions of objects.

Sorry, I should have perhaps put a disclaimer in my original comment. I work for a company called StorReduce and built our replication feature* (an intelligent, continuous "sync" effectively). We currently have a patent pending for our method, so I'm not sure if I can offer any real insight unfortunately.

I haven't looked at your project, but based on what you've said I agree the way you're doing it is conceptually as fast as it can be (massively parallel and leveraging metadata) whilst being a general purpose tool that "just works" and has no external dependencies or constraints.

* http://storreduce.com/blog/replication/