Hacker News new | ask | show | jobs
by NobodyNada 1038 days ago
Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

The goal of compression, then, is to make common strings shorter while making uncommon strings longer. You can think of, say, UTF-8 as a simple compression algorithm: it would take 20 bits per character to encode all Unicode code points with a fixed-width encoding, but UTF-8 takes advantage of the fact that the most commonly used characters have a lot of leading zeroes (at least in English). So the characters of the English alphabet require only 8 bits to encode, but uncommon characters require up to 32 bits.

Thus, I would expect an LLM-based compression algorithm to do well on strings that were common in its training data, and make strings that were uncommon or absent slightly longer. If it did not do that, it would not be a lossless compression algorithm.

3 comments

I don't disagree with anything you said.

I'm saying more that if the compression algorithm is benchmarking against "Alice in Wonderland" and has consumed the entirety of "Alice in Wonderland" in training the LLM (along with popular paragraphs and sentences quoted elsewhere), then it might do particularly well at reciting lines from that book and thus be able to compress it extremely well. I'd be more interested in seeing the compression algorithm's performance on new or unreleased works that would have no way of being training data.

As an extreme hypothetical, I could make a compression algorithm that is a table mapping an ID to an entire book and fill it with all the popular works. "Alice in Wonderland" would be replaced with a single short identifier string and achieve a ~0.001% compression ratio. An unseen work would be replaced with an <unknown> ID followed by the entire work and be slightly bigger. Then, I benchmark only the popular works and show insanely impressive results!

I have no doubt the LLM compressor would do really well on unseen works based on what you said above, but it's not a fair look at its performance to run it on works it may have been explicitly trained on.

Well, unless the weights are part of the compressed file.
Gotcha, sorry I completely misunderstood what you were asking! That’s a really insightful question that didn’t cross my mind at all.
> there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

This is due to the pigeonhole principle

>Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

I think I'm missing a point here. Is it a provable fact that all lossless compression algorithms are neutral at best? Or just something that occurs in the real world?

I can't think of a good example, so I'll give a trivial example to explain my thinking.

It seems to me that you could have an algorithm which did not change any text, except the sentence "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo", which it encodes as something much shorter like "<Buffalo*8>". The algorithm would be (very marginally) better than neutral. If this is true, then surely other improvements would be possible to make a lossless algorithm with real reduction.

(Edited a typo)

The impossibility of a general lossless compression algorithm is a consequence of the pigeon-hole principle.

https://en.m.wikipedia.org/wiki/Pigeonhole_principle

Then how do you encode an input of "<Buffalo*8>"? If your answer is "I'm using characters outside the alphabet" then, there's your answer, you're wasting bits on them.