| From http://prize.hutter1.net/: Being able to compress well is closely related to intelligence as explained below. While intelligence is a slippery concept, file sizes are hard numbers. Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 1GB of Wikipedia better than your predecessors, your (de)compressor likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs as a path to AGI. The Task:
Losslessly compress the 1GB file enwik9 to less than 114MB. More precisely: - Create a Linux or Windows compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that S:=S1+S2 < L := 114'156'155 (previous record). - If run, archive.exe produces (without input from other sources) a 10^9 byte file that is identical to enwik9. - If we can verify your claim, you are eligible for a prize of 500'000€×(1-S/L). Minimum claim is 5'000€ (1% improvement). - Restrictions: Must run in ≲50 hours using a single CPU core and <10GB RAM and <100GB HDD on our test machine. |
That criterion is rather more complicated than just taking the size S2 of archive.exe There is no logical meaning to the sum of the size of a compressor and its output that I can see.
Disregarding the compressor size would make this contest easier to understand as simply trying to determine the Kolmogorov Complexity (i.e. information content) of enwik9. I looked for the motivation of including compressor size and only found this in the FAQ:
> By just measuring L(D)+L(A), one can freely hand-craft large word tables (or other structures) used by C and D, and place them in C and either D or A. By counting both, L(C) and L(D), such tables become 2-3 times more expensive, and hence discourages them.
Discouraging word tables seems like a weak and somewhat arbitrary justification for complicating the measure of merit. I don't think the contest would be any less interesting if the nature of the compression would be disregarded. One could even argue that compression of Wikipedia taking way more resources than decompression is justified by having to perform it only once, while its result could be decompressed millions of times.