Hacker News new | ask | show | jobs
by nikcub 5385 days ago
My biggest issue (beside the initial TC article being a complete shocker) was the claim of 60% saving on de-duplication and that each user only had 25GB of unique data.

This research paper from Microsoft on Farsite[2] claims 'up to 50%' saving on de-dupe with a convergent file system - but that was tested against 500 computers in a corporate environment and it was done back in 2002.

Users now store a lot more photos, a lot more of their own video, and any content that is DRM'd is also unique. You can save on operating system and application files, but it isn't 60%.

There is nothing 'finally' about this additional information. The discussion and criticism of the claims on Twitter was knowing this information about convergent encryption and the key being derived from the content. There is a lot more that is still unanswered - such as how an 'intelligent cache' allows 'unlimited' storage to be available offline.

I really wish these guys would release a research paper with their results, or include more information on their website before they make such bold claims in public.

[1] http://research.microsoft.com/apps/pubs/default.aspx?id=6995...

2 comments

any content that is DRM'd is also unique.

In most cases this isn't true. The computation involved in keying media on the fly while it's being downloaded is not insignificant when considered in volume. The added pain of storing everyone's unique keys also discourages this behavior. At worst you'll see different keys being used by region or datacenter, or perhaps key rotation on a weeks-months scale.

Some media (both DRM and non-DRM) will be trivially unique because of metadata like purchaser info or music tags. In some cases this makes the first block unique but all later blocks are deduped, in other cases you need to be somewhat content aware so you can treat the header data separate from the real media data. This also allows you to catch a lot of data people ripped themselves using standard settings.

You can save on operating system and application files, but it isn't 60%.

While I agree that photos and videos will be the bulk of their problem, I don't think that ruins their premise. The question is if their userbase will be significantly overweight on heavy media creators. If it's a standard distribution, I wouldn't be surprised if a majority of people were under 10gb unique and 70%+ deduped.

I tested it on iTunes. bought the same television episode on two different accounts on two different laptops and compared them to find that they had 4-5 bytes in common.

Not sure how it works in WMP. It would be common in blu-ray rips, though.

I have to say that I am really curious as to why you would want to base your business viability on 'assumptions' regarding the average amount of users unique data.

I think one missed point when doing this generalization is also the target audience in early adopters for an unlimited service.

As we at SpiderOak quickly found it there are massive differences between 'common users' and the early adopters of a technically proficient service.

I am worried that Bitcasa will be in for a pretty rude awakening in that a large percentage of their early paying users will be JUST the people that store a TON of unique data (Raw files, encrypted data etc), skyrocketing their storage costs in a vulnerable phase of their development.