Why is this an invalid approach? If you have an agreed-upon OS installation, you've got at least several hundred megabytes of binary data from which you can extract sequences for a lookup dictionary. To compress, you locate largest substrings of the input that also appear somewhere in the OS data. For example, you might find that bytes 1000-2000 of the input data are the same as bytes 4528-5528 of /bin/ls. Then you just need a program that can extract this byte range from /bin/ls and fill it in at the appropriate place.
Of course, it isn't a given that you will be able to find a sufficiently large overlapping substring in the OS binary data. And it may not be allowed to assume an agreed-upon OS installation, although it was OK to assume that gunzip was available. And finally, doing this sort of largest-overlapping-substring search could be very very slow.
The real key here is that the challenge is about compressing one specific random file. It's clearly impossible to write a compressor that can represent ANY sequence of N bytes in fewer than N bytes including the decompressor. But this challenge is constrained to reconstructing a single input file, not handling arbitrary inputs. If the input files are random, and you keep asking for them, there's a tiny chance you'll get one with enough repetition that even gzip could compress it. If you can key against a large set of reference data, your odds improve significantly.
So, I don't know if this is a practical approach given the restrictions of the contest, the cost of entering, and the expected payoff, but I would say that if it is plausible to use a large body of OS files for lookup, this could be a winnable challenge.
Because of information theory. It's believed that the binary expansion of pi contains all possible finite bit sequences. A program that expands pi is relatively small. Assuming that hypothesis is true, does that mean we could write a compressor that simply breaks input into blocks and then encodes the output as indexes into pi?
And the answer is that we certainly could write that program. And the result would almost always be a larger output than input. Why? Because you'd have to search so far into pi that the index would contain more bits than the input blocks. In short, having a library of arbitrary strings to draw from does not help you to compress data.
More generally, compression is impossible in the general case. Every lossless compression algorithm will, on average, produce an equal or larger output than its input.
Of course, it isn't a given that you will be able to find a sufficiently large overlapping substring in the OS binary data.
It seems that it's a given that you will not find that substring. The larger it is, the less likely you will find it and the less likely it will repeat in the uncompressed file.
Unlikely but not impossible. Note that the substring does NOT need to repeat in the uncompressed file; it just needs to be possible to say "copy bytes X through Y of /bin/ls here" in fewer than Y-X bytes. I'd love to know what the odds really are, but it seems like you have some good options for improving them (e.g. ask for a really huge input file).
> it just needs to be possible to say "copy bytes X through Y of /bin/ls here" in fewer than Y-X bytes
It is very unlikely that the length of the sequence will be smaller than the space required to store X. Your plan is similar to saying "The input could contain long repeated segments containing only zeroes. Compress those using scheme Z. The longer the file, the more likely this is to occur.". Information theory allows us to prove that this cannot always work. In fact it is very unlikely to work for a specific randomly generated target file.
So by saying 'cannot _always_ work,' and 'very unlikely to work,' you are implicitly conceding the point- that it is possible. The question then becomes, given the size of the OS environment space (and the quality of the randomness in the originals), could one in 50 attempts actually succeed in shaving off a byte? I sincerely doubt that Mike, despite his superior knowledge of information theory, actually took the time to figure out just how good or bad the odds were. He assumed that it was fundamentally impossible, when it is in fact not impossible, just unlikely (to some unknown [to me] degree).
Let me just put it this way. It's more likely that you could win using gzip. It's similar to saying it's "possible" that the target file could have contained all zeroes.
Assuming the target file is competently constructed, the chances of winning wouldn't make it a rational bet for $0.01
On average, even if you have the specific file to work with, encoding those numbers, namely 1000 to 2000, will cost at least as much space as the lengths of the sequence you find. So it would be possible to do this, but the sequences would be so short, and offsets so large, that you'd end up losing.
One way to structure a compressed file is as a lookup dictionary. To decompress the file, you'll need to agree on which file is to be decompressed. If you need to have an agreed OS installation to act as a lookup dictionary before decompression can happen, by all rights you should include the size of the OS in your total of how large the compressed file is.
If you want to use an arbitrary OS, the "decompresser" would need to be able to identify the correct fragments. If you can do that, in less space than the original files, you've already managed to compress the data!
Of course, it isn't a given that you will be able to find a sufficiently large overlapping substring in the OS binary data. And it may not be allowed to assume an agreed-upon OS installation, although it was OK to assume that gunzip was available. And finally, doing this sort of largest-overlapping-substring search could be very very slow.
The real key here is that the challenge is about compressing one specific random file. It's clearly impossible to write a compressor that can represent ANY sequence of N bytes in fewer than N bytes including the decompressor. But this challenge is constrained to reconstructing a single input file, not handling arbitrary inputs. If the input files are random, and you keep asking for them, there's a tiny chance you'll get one with enough repetition that even gzip could compress it. If you can key against a large set of reference data, your odds improve significantly.
So, I don't know if this is a practical approach given the restrictions of the contest, the cost of entering, and the expected payoff, but I would say that if it is plausible to use a large body of OS files for lookup, this could be a winnable challenge.