Hacker News new | ask | show | jobs
by dexen 5618 days ago
This left me wondering; wouldn't use of deduplication storage backend (like http://en.wikipedia.org/wiki/Venti) lessen the incremental cost of servicing each new customer?

Couldn't it be well expected that, encrypted data aside, files with same content are often used by more than one persona?

2 comments

They are already doing some deduplication. From https://spideroak.com/whyspideroak :

> Greatly reduce backup & sync time through comprehensive compression and advanced de-duplication (saving you time)

> You are only charged for the compressed de-duplicated data amount (saving you money)

Still it is not clear if they do cross-user deduplication, but I think it is very unlikely because all the content is encrypted with an user-specific key, which I think they don't have access to.

They don't do cross-user deduplication, yes. They only deduplicate data that belongs to you.
I'd expect most people use these services for backups of business documents, which are almost guaranteed to be unique.

I seem to recall that Dropbox (or another well known online storage startup) implements this strategy. Maybe it works.

Which will be tiny.

Videos, audio and photos and game media will take the bulk of the space. Of those - only photos and a proportion of the videos are likely to be totally unique to a user.

I doubt that's true for Dropbox, at least.

Consider that Dropbox gives you 50gb for the basic plan. I'm guessing most people don't back up videos, games or their OS using that space, but rather back up their documents, projects they're working on in whatever field, photos, and music.

Of those, only with music is there a chance to use deduplication, and that's assuming you can figure out two music files with different ID-tags are the same.

(Come to think of it, in my Dropbox, music easily takes up 70% of my quota, so maybe is is worthwhile after all.)

You need a decent hash of every file anyway (to check for changes etc) so it's pretty trivial to deduplicate. I don't think you'd need to do stuff like check the ID-tag.
But then my music files, which I edit the id-tags for, will show up as different than other people's when hashing.

It would be interesting for Dropbox to release numbers on how many music files are identical between different people.

I believe they could (and probably do?) de-dupe at a lower than file level to handle this issue.