Hacker News new | ask | show | jobs
by forgotusername 4982 days ago
> but if you can put in any data you want, you could poison the compression enough to put it into a bad state -- one where effectively nothing compresses properly, and you end up with your own data completely in the clear.

This sounds interesting, and completely counter to my understanding of how compression works.. I thought CRIME's innovation was to exploit compression ratios as a proxy for cleartext. Poking a compressor into emitting probabilistically assigned bit-aligned symbols that happen to correspond to its input seems unlikely.

2 comments

A lot of compression algorithms use fixed size dictionaries over a window. Let's take the canonical example of zlib/gzip, which does this.

With careful control over the data, you could make it so no dictionary lookup would ever succeed. This would mean no strings would ever be eliminated through backreferences.

Most also have a minimum match length, making your life a bit easier here. Most also are outputting encoded streams that basically a little decompression VM (IE literal, 0, backreference, 255 bytes ago, size 30). Because of this, they will not eliminate duplicates where the match is too small (only 2 bytes).

This will get you past the duplicate elimination phase, but not the huffman phase.

Getting past the huffman phase is harder. To get it to output a stored block, you have to convince it the raw literal length of the block will end up less than the length of the block as encoded.

For zlib, we have opt_len = (sizeof (compressed data in bits) + sizeof (huffman tree in bits ) + 3 + 7) / 8

if (stored_len+4 <= opt_lenb) use stored block

So you do have some chance to beat it by messing with the probability distributions, and do get a little leeway.

On the plus side, you only need to mess with the probability for a single zlib block, not the whole shebang.

If you can put arbitrary content into data to be compressed and don't have any real length limits (<1MB or so, probably) then skewing the probabilities to have the compressor spit out the data you want is pretty straightforward. I'll write a blog post on that at some point, as it's a technique I abuse in a couple places.