Hacker News new | ask | show | jobs
by bloaf 2169 days ago
I agree. I've seen some marvelous pigeon hole explanations of why you cannot write a general lossless compression algorithm that makes all files smaller. I have no idea what that explanation would look like in terms of maxes and averages, and I have to believe it would be much more difficult to understand.
3 comments

Yes, talking about averages sounds like it might lead a prover astray into reasoning about the average size of compressed data. The average that matters here though is how many inputs map to the same output.

Compressing (n+1)-bit input to n-bit output makes the average output correspond to 2^(n+1)/2^n = 2 inputs. By Dijkstra's rephrasing of the pigeonhole principle, the maximum number of inputs colliding on the same output must be at least 2.

Or we could make it more robust and say we're compressing to n-bit-or-smaller output. Then the average output corresponds to 2^(n+1)/(2^(n+1)-1) inputs. That average is still strictly greater than 1, so the maximum number of inputs colliding on the same output must also be strictly greater than 1 (at least 2, since that's the smallest natural which is greater than 1).

That's really more about surjectivity and simple counting. Neither the pigeon hole principle not a max/average argument are really necessary.

Say you're trying to find a way to compress every n+1 bit string into an n bit string. Losslessness means that you have to come up with a function d from n bit string to n+1 bit string that decompresses. The number of n+1 bit string is larger, so there must be n+1 bit strings that are not in the image of d, which means that such a function d cannot exist.

This is still the pigeonhole principle, just inverted - If you have more holes than pigeons (and you can’t make fractional or quantum pigeons) at least one hole must have no pigeons.
It's similar, but that's not usually referred to as the pigeon hole principle.
It's more than similar, it's gone from talking about the necessary properties of a function with an image smaller than its domain, to the impossibility of an inverse function as the image cannot be larger than its domain.

We might as well call this "the whole-pigeon principle."

A "file" on a computer is just a number.

When you compress a file, you are representing a big number with a small number.

There are more big numbers than there are small numbers.

You've just expressed part of the problem with numbers, you haven't tied them into averages and minimums/maximums.
Why is that required?
The prompt you were responding to was "I have no idea what that explanation would look like in terms of maxes and averages, and I have to believe it would be much more difficult to understand."

Specifically, they were looking for how you write the can't-always-compress proof using formulation (1)

Oh right. I read it as "how else could you explain it simply". I always thought the pigeon hole principal was a needless metaphor.