Hacker News new | ask | show | jobs
by tnmom 781 days ago
Huh, never heard that before. Does it leak more information than just encrypting without zipping? Struggling to imagine how this attack works.
4 comments

It's an extension of the chosen-plaintext attack, and so requires the attacker to be able to send custom text that they know is in the encrypted payload. If the unencrypted payload is "our-secret-data :::: some user specified text", then the attacker can eventually determine the contents of our-secret-data by observing how the size of the encrypted response changes as they change the text when the compression step matches up with a part of the secret data. It can be defeated by adding random-length padding after compression and before the encryption step, though.
Essentially if you zip something, repeated text will be deduplicated.

For example "FooFoo" will be smaller than "FooBar" since there is a repeated pattern in the first one.

The attacker can look at the file size and make guesses about how repetitive the text is if they know what the uncompressed or normal size is.

This gets more powerful if the attacker can insert some of their own plaintext.

For example if the plaintext is "Foo" and the attacker inserts "Fo" (giving "FooFo") the result will be smaller than if they inserted zq where there is no pattern. By making lots of guesses the attacker can figure out the secret part of the text a little bit at a time just by observing the size of the ciphertext after inserting different guesses.

Encrypting without zipping doesn't leak any information about the content. You can't rule out certain byte sequences (other than by total length) just by looking at the ciphertext length.

If "oui" compresses to two bytes and "non" compresses to one byte, and then you go over them with a stream cipher, which is which:

A: ;

B: *&

This has nothing to do with compression. If you use "yes" and "no" instead of "oui" and "non" (which just happen to be three characters each) and you compress "yes" to "T" and "no" to "F" then the uncompressed text will be the leaky one.
It’s an example meant to prove the idea.
Yes, and my example was an example meant to prove the opposite idea. The point is that it is irrelevant whether you compress or not. You can leak information either way.
I leak the length of my phone call and you leak:

1. the length of your phone call; and

2. what language you were speaking; oh and

3. half the words you said

(i.e. pwned)

https://web.archive.org/web/20080901185111/https://technolog...

> you leak [a bunch of stuff]

How? Remember, the uncompressed text gets encrypted too.

Cryptography noob here: I'm confused by "Encrypting without zipping doesn't leak any information about the content." Logically speaking, if we compress first and therefore "the content" will now refer to "the zipped content", doesn't this mean we still can't get any useful information?
Not OP, but 'zipping and encrypting' one thing (a file for example) does not leak information by itself. The problem comes when an adversary is able to see the length of your encrypted data, and then can see how that length changes over time - especially if the attacker can control part of the input fed to the compressor.

So if you compressed the string "Bob likes yams" and I could convince you to append a string to it and compress again, then I could see how much the compressed length changed.

If the string I gave you was something already in your data then the string would compress more than it would if the string I gave you was not already in your data - "Bob likes yams and potatoes" will be larger than "Bob likes yams likes Bob".

If the only thing I can see about your data is the length and how it changes under compression - and I can get you to compress that along with data that I hand to you - then eventually I can learn the secret parts of your data.

Encryption generally leaks the size of the plaintext.

This is true in both the compressed and non-compressed case. However with compression the size of the plaintext depends on the contents, so the leak of the size can matter more than when not using compression.

Even without compression this can matter sometimes. Imagine compressing "yes" vs "no".

> Encryption generally leaks the size of the plaintext.

Ah, I see. Naïvely, this seems like a really bad thing for an encryption algorithm to do—is there no way around it? Like, why is encryption different from hashing in this regard?

There are methods, but they are generally very inefficient bandwidth wise in the general case. The general approach is to add extra text (pad) so that all messages are a fixed size (or e.g. some power of 2). The higher the fixed size is, the less information is leaked and the less efficient it is. E.g. if you pad to 64mb but need to transmit a 1mb message, that is 63mb of extra data to transmit.

Part of the problem (afaik) is we lack good math tools to analyze the trade offs of different padding size vs how much extra privacy they provide. This makes it hard to reason about how much padding is "enough".

Another approach is adding a random amount of padding. This can be defeated if you can force the victim to resend messages (which you then average out the size of).

Hashing is different because you don't have to reconstruct the message from the hash. With encryption the recipient needs to decrypt the message eventually and get the original back. However there is no way to transmit (a maximally compressed) message in less space then it takes up.

There are special cases where this doesn't apply e.g. if you have a fixed transmission schedule where you send a sprcific number of bytes on a specific agreed upon schedule.

Yes, of course it leaks more information than encryption without compression, because that’s just encryption which doesn’t leak anything.

In an enormous number of real world cases adversaries can end up including attacker-controller input alongside secret data. In that case you can guess at secret data and if you guess correctly, you get smaller compressed output. But even without that, imagine the worst case: a 1TB file that compresses to a handful of bytes. Pretty clearly the overwhelming majority of the text is just duplicate bytes. That’s information which is leaked.