|
Why is this an invalid approach? If you have an agreed-upon OS installation, you've got at least several hundred megabytes of binary data from which you can extract sequences for a lookup dictionary. To compress, you locate largest substrings of the input that also appear somewhere in the OS data. For example, you might find that bytes 1000-2000 of the input data are the same as bytes 4528-5528 of /bin/ls. Then you just need a program that can extract this byte range from /bin/ls and fill it in at the appropriate place. Of course, it isn't a given that you will be able to find a sufficiently large overlapping substring in the OS binary data. And it may not be allowed to assume an agreed-upon OS installation, although it was OK to assume that gunzip was available. And finally, doing this sort of largest-overlapping-substring search could be very very slow. The real key here is that the challenge is about compressing one specific random file. It's clearly impossible to write a compressor that can represent ANY sequence of N bytes in fewer than N bytes including the decompressor. But this challenge is constrained to reconstructing a single input file, not handling arbitrary inputs. If the input files are random, and you keep asking for them, there's a tiny chance you'll get one with enough repetition that even gzip could compress it. If you can key against a large set of reference data, your odds improve significantly. So, I don't know if this is a practical approach given the restrictions of the contest, the cost of entering, and the expected payoff, but I would say that if it is plausible to use a large body of OS files for lookup, this could be a winnable challenge. |
Because of information theory. It's believed that the binary expansion of pi contains all possible finite bit sequences. A program that expands pi is relatively small. Assuming that hypothesis is true, does that mean we could write a compressor that simply breaks input into blocks and then encodes the output as indexes into pi?
And the answer is that we certainly could write that program. And the result would almost always be a larger output than input. Why? Because you'd have to search so far into pi that the index would contain more bits than the input blocks. In short, having a library of arbitrary strings to draw from does not help you to compress data.
More generally, compression is impossible in the general case. Every lossless compression algorithm will, on average, produce an equal or larger output than its input.