Something is twitching in the back of my mind about this. Sure, they can't look at the data based solely on the encrypted copy, but if they have a plaintext copy of a document of interest, they are able to determine which of their customers has that document, right?
Sure, but known-plaintext attacks are not the worst part. Consider this [found via http://www.mail-archive.com/cryptography@metzdowd.com/msg089...]: I take the standard Wordpress config.php [for your host], fill in your site and account name, fill in the one million most common database passwords, and ask the cloud provider whether any of these hashes exist.
Or: I create a form (say .doc) with a single field, CC#, and hope people store this. I then check the existence of 10^11 hashes to find (all customers'!) credit card numbers (for a specific issuer). This takes only a CPU-day! (The network is obviously slower.)
Eh. OK let's say instead of just SHA-256'ing the plaintext data to derive a key you do 50,000 bcrypt rounds. Then the client encrypts the plaintext, hashes the ciphertext, and sends the hash to the server. If it takes 0.5 s to generate a single bcrypt key, it would take about 1,500 years to find a single credit card number.
This does introduce new avenues for attacks, however. You don't have to be able to decrypt to show that certain people have certain files.
Also, for files that contain just one piece of sensitive information and a the rest is predictable (i.e, the secret key file for a website back-end), you've effectively given up a hash of the secret which can then be brute-forced.
EDIT2: Actually, there's more to this problem than just convergent encryption. If the storage provider knows which encrypted blobs belong to you, it can encrypt _some_ file and still figure out which users have copies of it. So, the storage provider, which stores a collection of encrypted blobs, should not know the blob -> list(users) association. I don't know if Bitcasa addresses this part.
My biggest issue (beside the initial TC article being a complete shocker) was the claim of 60% saving on de-duplication and that each user only had 25GB of unique data.
This research paper from Microsoft on Farsite[2] claims 'up to 50%' saving on de-dupe with a convergent file system - but that was tested against 500 computers in a corporate environment and it was done back in 2002.
Users now store a lot more photos, a lot more of their own video, and any content that is DRM'd is also unique. You can save on operating system and application files, but it isn't 60%.
There is nothing 'finally' about this additional information. The discussion and criticism of the claims on Twitter was knowing this information about convergent encryption and the key being derived from the content. There is a lot more that is still unanswered - such as how an 'intelligent cache' allows 'unlimited' storage to be available offline.
I really wish these guys would release a research paper with their results, or include more information on their website before they make such bold claims in public.
In most cases this isn't true. The computation involved in keying media on the fly while it's being downloaded is not insignificant when considered in volume. The added pain of storing everyone's unique keys also discourages this behavior. At worst you'll see different keys being used by region or datacenter, or perhaps key rotation on a weeks-months scale.
Some media (both DRM and non-DRM) will be trivially unique because of metadata like purchaser info or music tags. In some cases this makes the first block unique but all later blocks are deduped, in other cases you need to be somewhat content aware so you can treat the header data separate from the real media data. This also allows you to catch a lot of data people ripped themselves using standard settings.
You can save on operating system and application files, but it isn't 60%.
While I agree that photos and videos will be the bulk of their problem, I don't think that ruins their premise. The question is if their userbase will be significantly overweight on heavy media creators. If it's a standard distribution, I wouldn't be surprised if a majority of people were under 10gb unique and 70%+ deduped.
I tested it on iTunes. bought the same television episode on two different accounts on two different laptops and compared them to find that they had 4-5 bytes in common.
Not sure how it works in WMP. It would be common in blu-ray rips, though.
I have to say that I am really curious as to why you would want to base your business viability on 'assumptions' regarding the average amount of users unique data.
I think one missed point when doing this generalization is also the target audience in early adopters for an unlimited service.
As we at SpiderOak quickly found it there are massive differences between 'common users' and the early adopters of a technically proficient service.
I am worried that Bitcasa will be in for a pretty rude awakening in that a large percentage of their early paying users will be JUST the people that store a TON of unique data (Raw files, encrypted data etc), skyrocketing their storage costs in a vulnerable phase of their development.
it's important to note that this is not strong against knowledge of the plaintext. that's kind-of obvious, when you think about how it supports de-duplication, but perhaps an example will clarify why you might be concerned.
say you want to backup some data. and that data includes music or video... and the riaa or mpaa decide that bitcasa are facilitating pirating and should be shut down... so they reach a deal where all the data are checked against known songs or videos. and if they find a match then your identity will be provided for prosecution...
of course, if you are doing nothing wrong, you have nothing to fear. this can only identify known data. but even so, it is an interesting issue: "encryption" here doesn't have all the guarantees you might expect.
(there are more disturbing scenarios too. for example, perhaps a certain text is not illegal in the copyright sense, but is unacceptable politically.)
[disclaimer - this is from skimming the paper; i should say that i am no expert on this, so don't take my word as gospel]
"HP: What do you do in terms of encryption or security?
TG: We encrypt everything on the client side. We use AES-256 hash, SHA-256 hashing for all the data.
HP: So it’s encrypted all on the client side and you can’t look at it on the server side?
TG: Exactly"
Finally, a company that gets it. I've been asking for this for a while now. I wish Dropbox and all the others would do this, too. I get it that some of Dropbox' customers may not want to deal with the encryption on the client side, but they should at least offer the option to everyone, and it should be right there every time someone wants to upload something. It would be best if it was the default option, too.
This way they won't get into the mess they got into last time with the feds asking for user data, and the clients who want full security of their data won't have to be worried about it anymore.
Disclaimer: no affiliation, just like the product. I use SpiderOak to backup private things like AWS keys, KeePass data file, Bitcoin wallet, etc., and Dropbox for documents, photos, and everything else not quite as sensitive.
In addition to Wuala, Spideroak does this as well.
A problem remains "with full security" in that you have no idea what's going on in the binary client program. Reveal or open-source the client program and allow customers who need this end-to-end security to compile the program themselves.
We at SpiderOak in fact do not cross-account deduplicate AT ALL and provide a full zero-knowledge environment with no access to client side encryption key info.
We feel that the possible cost savings involved with deduplicating data across user accounts is just not worth the inherent security risks.
TL;DR version: take a chunk of data, encrypt it with its own sha1 hash as the key. Now you have an encrypted version that you can dedup. You can only decrypt if you already know the hash. Info about who owns any particular chunk is not kept on the server, so even if you break in to the server, all you can tell is which chunks correspond to data you already possess. Seems plausible.
The list of "who owns which hashes" must be stored on their servers, even if it's not the "same" server. Otherwise I would have to manually transfer my hashes from one computer to another.
Well, OK, but that data can also be convergently encrypted, so you only have to transfer the hash, not the whole list. But your point is well taken. If you can get your data from a different machine with nothing but a user name and password, that's probably a security hole.
> How would I know the hash? I'd have to save the individual key(hash) for every file I upload?
Yes, but that's no different from keeping any other kind of directory structure. And you can apply the same trick to the directory structure itself, so all you really need to keep is a "root hash" to your (encrypted) directory.
> Also, if I wanted to know if you had a specific file (and I had access to all your encrypted files) this would be trivial, correct?
What do you mean by "I had access to all your encrypted files"? If you've broken in to the server, everything is encrypted, including the directories. The only thing you can tell is whether a particular encrypted block corresponds to data that you already possess (or possessed at some time in the past). But that by itself tells you nothing.
While technically correct, that's not a practical observation. A memory bank storing your “large enough repository of user files” would consume the entire universe.
That said, people don't store random bitstrings. People store music on these shared storages--if I were a big media company I could find all the MP3's of songs I own floating around P2P networks, compute their encrypted forms and subpeona the storage company for user accounts storing any one of the the files. People have also been known to synchronize application data, including files with secret keys or passwords, which in this case effectively shares a hash of the password. That's better than dropbox, but still if the key + normal file variation doesn't have enough entropy an attacker could brute-force the contents of the file.
EDIT: Those are just potential real-world attacks I can think of on the spot; I'm sure there are plenty of others. While this is certainly (marginally) better than Dropbox, real security and data de-duplication are mutually exclusive.
Dedup also helps with big media files that are identical across many users.
I'm pretty sure Bitcasa has said they avoid uploading known duplicates at all, which suggests the hash (either pre- or post- encryption) is shipped up first.
Cross account deduplication has nothing to do with making improvement client side, it's all about keeping storage costs down for the provider.
The only type of deduplication that matters to consumers is account specific deduplication, and that saves no data for the provider unless you charge for non deduplicated storage.
Cross-account deduplication does have a client-side benefit: duplicate files don't need to be uploaded. E.g. your 300 GB iTunes library might sync to the server in a couple of minutes, rather than days.
Per-block dedupe, I bet, so it'll get benefits at sub-file levels.
Though given that the vast majority (by volume) of customer data will be people's illegal downloads, their scheme effectively reduces to being a copy of usenet.
the biggest win by far is media files and other large assets like game files or large data sets that many, many people will have duplicate copies of. Even with a pretty low installed base 98%-100% of people's iTunes storage will be deduped. OS files are a big win too, really very very little (by block volume) on most people's drives is unique.
basically the argument is that this is an encryption algorithm that is deterministic as there is no randomness, after the initial value. This sounds more like a Random Oracle, http://en.wikipedia.org/wiki/Random_oracle. which by the way don't exist
So I can encrypt a file, upload it, and if someone else encrypts the exact same file... they can decrypt my uploaded file? I'm having a hard time wrapping my head around this.
I believe that the encryption will encrypt to the same signature everytime. So, if someone else uploads the same encryption file, they just point your file to someone else's same encrypted file.
Doesn't that diminish some of the privacy claims?