|
Consider a scheme in which: Each user generates a symmetric "user key", kU. The plaintext of each file (or without loss of generality, block of data, etc.), pFile, is encrypted with a randomly generated symmetric key, kFile, producing the ciphertext cFile. pFile is also hashed with a cryptographically strong hash, producing hpFile. kFile is then encrypted with hFile, producing ckFile. The user encrypts pFile with kU, producing chpFile. Finally, the user takes the first N bits of hpFile (for N on the order of, say, 16 or 32), producing hpFileTrunc. The user then submits hpFileTrunc to the server. The server is, semantically, just a list of 3-tuples: (cFile, ckFile, hpFileTrunc). The server sees if it knows of the existence of records with the same hpFileTrunc value as the client's submission. If so, it returns them to the client. The client then tries, for each record returned by the server, decrypting ckFile2 with the client's hFile value, potentially producing kFile. If this is successful, the client then decrypts cFile with kFile, producing pFile. Finally, it compares this pFile to the original. If it matches, a match has been found, and the client exits the loop. If not, (or if either of the two decryption steps failed), it continues to the next record the server returned. If there are no more records, the client instead submits the tuple (cFile, ckFile, hpFileTrunc) to the server, which stores it. Finally (whether or not a match was found), the client stores chpFile locally, to be used when retrieving the file. To retrieve the file, the user decrypts chpFile with kU, producing hpFile. They truncate hpFile, producing hpFileTrunc, and submit it to the server. They perform the same process described earlier to retrieve the matching pFile. (Note: truncation may also be replaced by, or combined with, a second round of hashing.) With this scheme, assuming secure primitives (authenticated encryption and hashing), I don't believe it's possible to learn any information about a file unless you already have its contents. So the server can tell if you're accessing (storing or retrieving) a particular file if and only if the server knows what it's looking for. TL;DR: you can totally construct a scheme that allows meaningful comparison of plaintexts! But... this is probably a bad thing. Comparison of plaintexts is a vulnerability: the server being able to see who's storing a particular "bad" file has a real impact on privacy. And likely more subtle impacts, too... |
> The server sees if it knows of the existence of records with the same hpFileTrunc value as the client's submission. If so, it returns them to the client.
And by doing this, provides a way for clients to verify if any user on the file storage server has this file. So if I wanted to know if your mozilla thunderbird has a mail I have the source to, I simply try to store this and get these duplicate records.
Most people would consider this extremely unacceptable.
> The client then tries, for each record returned by the server, decrypting ckFile2 with the client's hFile value, potentially producing kFile. If this is successful, the client then decrypts cFile with kFile, producing pFile. Finally, it compares this pFile to the original. If it matches, a match has been found, and the client exits the loop. If not, (or if either of the two decryption steps failed), it continues to the next record the server returned. If there are no more records, the client instead submits the tuple (cFile, ckFile, hpFileTrunc) to the server, which stores it.
Why would the client have the keys to files stored by other users ?
Unless you mean that you can only deduplicate within a single client, in which case that's of much more limited use (and I might add, your encryption scheme is way more complex than it needs to be).