| HN Mirror

A lot of compression algorithms use fixed size dictionaries over a window. Let's take the canonical example of zlib/gzip, which does this.

With careful control over the data, you could make it so no dictionary lookup would ever succeed. This would mean no strings would ever be eliminated through backreferences.

Most also have a minimum match length, making your life a bit easier here. Most also are outputting encoded streams that basically a little decompression VM (IE literal, 0, backreference, 255 bytes ago, size 30). Because of this, they will not eliminate duplicates where the match is too small (only 2 bytes).

This will get you past the duplicate elimination phase, but not the huffman phase.

Getting past the huffman phase is harder. To get it to output a stored block, you have to convince it the raw literal length of the block will end up less than the length of the block as encoded.

For zlib, we have opt_len = (sizeof (compressed data in bits) + sizeof (huffman tree in bits ) + 3 + 7) / 8

if (stored_len+4 <= opt_lenb) use stored block

So you do have some chance to beat it by messing with the probability distributions, and do get a little leeway.

On the plus side, you only need to mess with the probability for a single zlib block, not the whole shebang.