Hacker News new | ask | show | jobs
by pitdicker 903 days ago
From http://prize.hutter1.net/:

Being able to compress well is closely related to intelligence as explained below. While intelligence is a slippery concept, file sizes are hard numbers. Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 1GB of Wikipedia better than your predecessors, your (de)compressor likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs as a path to AGI.

The Task: Losslessly compress the 1GB file enwik9 to less than 114MB. More precisely:

- Create a Linux or Windows compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that S:=S1+S2 < L := 114'156'155 (previous record).

- If run, archive.exe produces (without input from other sources) a 10^9 byte file that is identical to enwik9.

- If we can verify your claim, you are eligible for a prize of 500'000€×(1-S/L). Minimum claim is 5'000€ (1% improvement).

- Restrictions: Must run in ≲50 hours using a single CPU core and <10GB RAM and <100GB HDD on our test machine.

2 comments

> compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that S:=S1+S2 < L

That criterion is rather more complicated than just taking the size S2 of archive.exe There is no logical meaning to the sum of the size of a compressor and its output that I can see.

Disregarding the compressor size would make this contest easier to understand as simply trying to determine the Kolmogorov Complexity (i.e. information content) of enwik9. I looked for the motivation of including compressor size and only found this in the FAQ:

> By just measuring L(D)+L(A), one can freely hand-craft large word tables (or other structures) used by C and D, and place them in C and either D or A. By counting both, L(C) and L(D), such tables become 2-3 times more expensive, and hence discourages them.

Discouraging word tables seems like a weak and somewhat arbitrary justification for complicating the measure of merit. I don't think the contest would be any less interesting if the nature of the compression would be disregarded. One could even argue that compression of Wikipedia taking way more resources than decompression is justified by having to perform it only once, while its result could be decompressed millions of times.

I can't follow. If you'd disregard the size of the decompressor, you could encode as much data as you want in there, e.g. just assign a number to each article and include the corresponding text in the decompressor, or assign a number to each sentence occuring in Wikipedia, just dump the 1GB of data in there verbatim, etc...

That would make the size of the compressed file meaningless.

Disregarding the time complexity of the compression is interesting, even ignoring the space complexity during compression or the size of the compressor.

But the size of the decompressor can't be ignored when trying to "measure" Kolgomorov complexity of the source data.

> I can't follow. If you'd disregard the size of the decompressor

GP is talking about excluding the size of the compressor not excluding the size of the decompressor which obviously has to be included.

> There is no logical meaning to the sum of the size of a compressor and its output that I can see.

There is if you have a time limit. Otherwise you could spend a week computing a better result and embed that directly into the compressor - for use only if you're competing that Wikipedia file.

Why is a time limit needed?

And it doesn't matter if all of Wikipedia is embedded in the compressor — if you can reconstruct the original using only the decompressor and the compressed data, the process can be as expensive as you like.

That's the exciting part.

It's also a vague connection to what's exciting about LLMs, regarding the amount of information they can reproduce with comparatively small size of their weights. But since decompression must be lossless, it's unclear to me if this approach could really help here.

Edit: all of this is already addressed in the FAQ at http://prize.hutter1.net/

Especially the paragraph "Why lossless compression?" is interesting to read.

> And it doesn't matter if all of Wikipedia is embedded in the compressor

That's not what I said. I said you can embed a better result. As in, make an algorithm that performs ok in general case, but bake in a super-optimised result that specifically applies to Wikipedia, even if it takes months to precompute.

> Why is a time limit needed?

Would you like to judge my compressor against other ones? It will take 2000 years to get the result, but it's totally worth it, I promise. It's a random search of optimal binaries that produce the Wikipedia and it will stop once I'm over the winning threshold.

Thanks for replying. I see a bit more of your point now I guess.

I still tend to consider the needed complexity to produce that binary as irrelevant, because the challenge is not interested in practical compression, only in space complexity and the relation between information entropy and intelligence, no?

If such a highly-specialized binary could be theoritically produced for other corpuses of the same size as well, that would be very interesting! Regardless of how practical their computation is.

And if it's not possible for othe corpuses, the complexity has to be somehow embedded in the decompressor, not the compressor, or am I missing something? Only alternative would be that the compression exploits some data inherent to the machine architecture that is not part of the executable. Still, this only shifts the problem around.

The idea reminds of an optimized ordering of Gödel numbers.

The pidgeonhole principle limits what a general lossless compressor can do, but we don't know how the minimum size of a compressed file scales with the output size, because this problem is related to the Halting problem. Even if we assume infinite compute, we can't know the exact limits for calculating a function |c(n)| that specificies the minimum length of a program that computes an arbitrary output of size |n|. If I'm not totally ignorant and lacking information here.

>since decompression must be lossless, it's unclear to me if this approach could really help here.

All lossy compression can be made lossless by storing a residual. A modern language transformer is well suited to compression because they output a ranked list of probabilities for every token, which is straightforward to feed into an entropy encoder.[0]

On the other hand, LLMs expose the flaw in the Hutter prize - Wikipedia is just too damn small, and the ratio of "relevant knowledge" to "incidental syntactic noise" too poor, for the overhead of putting "all human knowledge" in the encoder to be worth it. A very simple language model can achieve very good compression already, predicting the next word almost as well as a human can in absolute terms. The difference between "phone autocomplete" and "plausible AGI" is far into the long tail of text prediction accuracy.

Probably a state of the art Hutter prize winning strategy would simply be to train a language model on Wikipedia only, with some carefully optimal selection of parameter count. The resulting model would be useless for anything except compressing Wikipedia though, due to overfit.

[0] A practical difficulty is that large language models often struggle with determinism even disregarding the random sampling used for ChatGPT etc, when large numbers of floating point calculations are sharded across a GPU and performed in an arbitrary order. But this can easily be overcome at the expense of efficiency.

> On the other hand, LLMs expose the flaw in the Hutter prize

I'm not sure why it would be a flaw — isn't this more a sign of how universally interestint the rules are?

Out of curiosity, I didn't dive deeply into the previous winner's implementation, but are there NNs (+ error correction matrix, that's why I mentioned the FAQ in other comment) among them?

Regurgitating something verbatim seems less like intelligence vs summarizing the important details. Additionally, a human can even choose the requirements for lossy compression in the first place - do I want a 1 sentence summary of what I read, a 1 paragraph summary, a synopsis, a description of the themes, how the themes relate to other works by the author/creator, etc etc.

So I fail to see the fundamental connection between intelligence and lossless compression unless you just choose to define intelligence in some way that connects it to lossless compression.