| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by caladri 5078 days ago

Well, it's possible to make the deduplication work with Glacier, though. Colin's right that you want to phase versions of the whole tar into slower and cheaper storage, but the technical problem of whether blocks are used in newer versions doesn't actually seem to be too much of a problem. You can compute it in a batch for extant tars, and track it for new ones. (And you can keep blocks-of-references always in S3 rather than Glacier, say.) The problem he identifies as most troublesome is what if you see a case where you would want to deduplicate, but the similar block is in Glacier, and so would be a bottleneck for the whole deduplication process. In this case, you could always treat it as a non-match, right? And optionally have some way to track it and then when it gets moved to Glacier either determine whether it was actually a match all along, or just have some duplicate data.

Of course, I don't know how tarsnap names its blocks or stores them, so I don't know how feasible it is to have two blocks with the same name because they had the same hash but there was a byte-by-byte mismatch, or if that's even a problem.

I mean, blocks that have been moved to Glacier because there are no references to them from indices on S3 can be assumed to be less likely to show up in new archives. It's a trade-off, but my experience with deduplication is that it's often not much of a trade-off to get rid of old things even though the magical thinker in me is tempted to think "but what if that chunk happens to show up again somewhere else!?"