Hacker News new | ask | show | jobs
by btschaegg 3299 days ago
I haven't read the code (read: speculation ahead!), but at least the "what's already there" part seems rather easy to me if the backups are performed in a chunk-based, deduplicated way (cf. also borg backup[1] and restic[2]): First, you perform a GET BUCKET[3], which gives you a list of all files in the bucket. If you name your chunk files after their hashes, that's all the info you need about which chunks you still have to upload. You can then proceed to chunk your local files and upload the missing parts.

The only question remaining would be the amount of data (i.e. filenames) you'll have to download per amount of data in the backups, which you can vary by adjusting the chunking size.

[1]: https://github.com/borgbackup/borg

[2]: https://restic.github.io/

[3]: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET...

Edit: Of course, because S3's PUT OBJECT[4] is idempotent in this case (i.e. ignoring hash collisions as their probability should be orders of magnitude lower than a doomsday scenario), you could of course just transfer each chunk every time. Realistically, all this would do is hog your bandwidth and ruin your performance. That's why it's possible to make the whole thing lock-free; otherwise you could always run into the problem of uploading the same chunk twice.

[4]: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT...