| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by OscarCunningham 2885 days ago

In fact, any lossless compression algorithm has the property that the output is (on average) at least as long as the input. The best you can hope for is an algorithm that compresses the kind of data that humans want to store, at the expense of making other data a bit longer. If you're trying to compress random data then you just can't do it.

Here's a proof: consider the strings of length n or less, suppose there are M of them in total. Their average length is just the sum of all their lengths divided by M, and the average length of their compressed versions is just the total length of the compressed versions divided by M. Since the compression is lossless the compressed strings must all be different.

Since there are M strings, if any of them mapped to a string of length more than n then there must be some string of length at most n not being mapped to, so the average length can be improved by instead mapping that string to the shorter string. So any optimal compression method must map only to the strings of length at most n.

So the M outputs are just the M inputs, possibly permuted. So their total length is the same, and hence their average length is the same.

2 comments

Const-me 2885 days ago

> any lossless compression algorithm has the property that the output is (on average) at least as long as the input.

The article you’ve linked says nothing about average. It says that for every algorithm there’s at least some input files that increase the size. It even explains more about that:

Any lossless compression algorithm that makes some files shorter must necessarily make some files longer, but it is not necessary that those files become very much longer. Most practical compression algorithms provide an "escape" facility that can turn off the normal coding for files that would become longer by being encoded. In theory, only a single additional bit is required to tell the decoder that the normal coding has been turned off for the entire input

link

OscarCunningham 2885 days ago

Thanks. I realized this just after I posted it, so I wrote the proof into my comment instead.

link

aengvs 2885 days ago

>In fact, any lossless compression algorithm has the property that the output is (on average) at least as long as the input

I don't think this is true. If it was, lossless compression would be useless in a lot of applications. It's pretty easy to come up with a counter example.

E.g.

(simple huffman code off the top of my head, not optimal)

symbol -> code

"00" -> "0"

"01" -> "10"

"10" -> "110"

"11" -> "111"

If "00" will appear 99.999% of the time, and the other 3 symbols only appear 0.001% of the time, the output will "on average" be slightly more than half the length of the input.

link

OscarCunningham 2885 days ago

Sure, I'm assuming that (a) you are trying to encode all strings of length at most n and (b) you have the uniform distribution over those strings. This makes sense in the original context of encoding random data.

link

aengvs 2885 days ago

>you have the uniform distribution over those strings. This makes sense in the original context of encoding random data.

Lossless compression is nothing more than taking advantage of prior knowledge of the distribution of the data you are compressing.

Random data isn't always (or even often) uniformly distributed. Everything we compress is "random" (in the context of information theory), so I disagree that it makes sense to assume uniformly distributed data.

link

OscarCunningham 2885 days ago

Then the original statement about not being able to use pi as a data compression method is false. It could be the case that 99% of the time you want to encode the string "141592653".

link

aengvs 2885 days ago

The efficacy of a compression algorithm is dependent on the data it is compressing, so that statement is true for some data.

link

tylerhou 2885 days ago

https://en.wikipedia.org/wiki/No_free_lunch_theorem

link

aengvs 2885 days ago

https://en.wikipedia.org/wiki/Entropy_(information_theory)

link