Hacker News new | ask | show | jobs
by luchs 3691 days ago
As I understand it (from reading the README) it's storing SHA(chunk) -> Enc(chunk) in its database. When adding to the backup, it checks whether SHA(chunk') is already there. Thus, it doesn't have to look at the encrypted data. However, it also can't verify that a chunk stores the correct data.
2 comments

So in a multiuser environment, the "first" client to upload an encrypted chunk and lie about it's plaintext-hash... Which would poison the well and anybody else gets a nasty surprise when their "backup" is always corrupted.
Multi-user deduplication also leaks information via timing attacks (user2 can upload a million key files to see if user1 already stored one).

It's better to only deduplicate on a per-user basis.

If your encryption is deterministic, the second client can check with the server that Hash(Enc(chunk)) is the same on the client and server.
> If your encryption is deterministic, the second client can check with the server that Hash(Enc(chunk)) is the same on the client and server

but the chunk on the server was encrypted using a different public-key, so how can hash(pub-key-1(chunk)) == hash(pub-key-2(chunk)) ?

Isn't only the decryption keys encrypted to the public keys?
They use nacl cryptobox primitive.

This means that you are right. Alas, the decryption key (they symmetric key used to encrypt this particular message) is derived deterministically from the private key and nonce. The nonce they use is the hash of the chunk. Thus, the same chunk will always be encrypted with the same symmetric key.

> Isn't only the decryption keys encrypted to the public keys?

from the readme, it appears (to me at least), that chunks are encrypted using public-keys. concretely, the following lines :

"Every time rdedup saves a new chunk file, it's data is encrypted using public key so it can only be decrypted using the corresponding secret key. "

Ah, rats, I think you are right.

I was hoping for a client I could run on all the machines I use and it encrypt locally and decrypt centrally.

Now that I think about it I think you really need a 3 tier setup.

Clients have to be trusted, otherwise all keystrokes are recorded, all files could be corrupt, ransomware can strike, etc. So tier1 encrypts locally before sending to the tier2.

Tier2 is less trusted, never sees plain text, but you trust it enough to not spend significant resource attacking your client.

Tier3 is not trusted, may be run by other individuals or organizations, may ship copies of your data to entities that try to break your encryption. Think of an offsite backup service or some distant friend/relative that's willing to let you store offsite backups on their potentially insecure machine.

So tier1 encrypts locally with convergent encryption (symmetric is easy, not sure if there's an asymmetric version) and offers encrypted blobs to tier2.

Tier2 does a dedup check and either accepts an upload or just tells the client they are subscribed to that encrypted blob. Then applies a reed-solomon to add some redundancy in case one of the offsite backups dies.

Tier3 just receives fixed size encrypted blobs to provide additional copies of backups in case the tier2 site dies. Maybe even have the tier3 find each other with a DHT. People just decide how many copies they want to keep, what redundancy is acceptable, and the tier3 maintains that.

So trusting sorts would use the tier1, a shared tier2, and keep 3-4 copies in the tier3's. They would enjoy deduplication benefits.

Non-trusting sorts would run a tier1 and tier2 on the same client. More secure, but also no benefit from deduplication across clients.

The really paranoid sorts would run a tier1 and tier2 locally and only trust manually introduced tier3s that they have a high confidence in.

Yes, I think you're right in your conclusions.