|
|
|
|
|
by NobodyNada
1038 days ago
|
|
Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output. The goal of compression, then, is to make common strings shorter while making uncommon strings longer. You can think of, say, UTF-8 as a simple compression algorithm: it would take 20 bits per character to encode all Unicode code points with a fixed-width encoding, but UTF-8 takes advantage of the fact that the most commonly used characters have a lot of leading zeroes (at least in English). So the characters of the English alphabet require only 8 bits to encode, but uncommon characters require up to 32 bits. Thus, I would expect an LLM-based compression algorithm to do well on strings that were common in its training data, and make strings that were uncommon or absent slightly longer. If it did not do that, it would not be a lossless compression algorithm. |
|
I'm saying more that if the compression algorithm is benchmarking against "Alice in Wonderland" and has consumed the entirety of "Alice in Wonderland" in training the LLM (along with popular paragraphs and sentences quoted elsewhere), then it might do particularly well at reciting lines from that book and thus be able to compress it extremely well. I'd be more interested in seeing the compression algorithm's performance on new or unreleased works that would have no way of being training data.
As an extreme hypothetical, I could make a compression algorithm that is a table mapping an ID to an entire book and fill it with all the popular works. "Alice in Wonderland" would be replaced with a single short identifier string and achieve a ~0.001% compression ratio. An unseen work would be replaced with an <unknown> ID followed by the entire work and be slightly bigger. Then, I benchmark only the popular works and show insanely impressive results!
I have no doubt the LLM compressor would do really well on unseen works based on what you said above, but it's not a fair look at its performance to run it on works it may have been explicitly trained on.