Hacker News new | ask | show | jobs
by BalinKing 782 days ago
Cryptography noob here: I'm confused by "Encrypting without zipping doesn't leak any information about the content." Logically speaking, if we compress first and therefore "the content" will now refer to "the zipped content", doesn't this mean we still can't get any useful information?
2 comments

Not OP, but 'zipping and encrypting' one thing (a file for example) does not leak information by itself. The problem comes when an adversary is able to see the length of your encrypted data, and then can see how that length changes over time - especially if the attacker can control part of the input fed to the compressor.

So if you compressed the string "Bob likes yams" and I could convince you to append a string to it and compress again, then I could see how much the compressed length changed.

If the string I gave you was something already in your data then the string would compress more than it would if the string I gave you was not already in your data - "Bob likes yams and potatoes" will be larger than "Bob likes yams likes Bob".

If the only thing I can see about your data is the length and how it changes under compression - and I can get you to compress that along with data that I hand to you - then eventually I can learn the secret parts of your data.

Encryption generally leaks the size of the plaintext.

This is true in both the compressed and non-compressed case. However with compression the size of the plaintext depends on the contents, so the leak of the size can matter more than when not using compression.

Even without compression this can matter sometimes. Imagine compressing "yes" vs "no".

> Encryption generally leaks the size of the plaintext.

Ah, I see. Naïvely, this seems like a really bad thing for an encryption algorithm to do—is there no way around it? Like, why is encryption different from hashing in this regard?

There are methods, but they are generally very inefficient bandwidth wise in the general case. The general approach is to add extra text (pad) so that all messages are a fixed size (or e.g. some power of 2). The higher the fixed size is, the less information is leaked and the less efficient it is. E.g. if you pad to 64mb but need to transmit a 1mb message, that is 63mb of extra data to transmit.

Part of the problem (afaik) is we lack good math tools to analyze the trade offs of different padding size vs how much extra privacy they provide. This makes it hard to reason about how much padding is "enough".

Another approach is adding a random amount of padding. This can be defeated if you can force the victim to resend messages (which you then average out the size of).

Hashing is different because you don't have to reconstruct the message from the hash. With encryption the recipient needs to decrypt the message eventually and get the original back. However there is no way to transmit (a maximally compressed) message in less space then it takes up.

There are special cases where this doesn't apply e.g. if you have a fixed transmission schedule where you send a sprcific number of bytes on a specific agreed upon schedule.