Hacker News new | ask | show | jobs
by quicktwo 1577 days ago
You can get down to 15,559 bytes by combining a trie with Huffman coding: https://github.com/adamcw/wordle-trie-packing

However, this doesn't beat general Brotli encoding of a ASCII trie representation, which gets down to 14,180 bytes (but needs an experience decoder), but goes to show general purpose compression is still really really good these days.

4 comments

Roadroller [1] is probably a borderline general purpose compression algorithm, and with some automatic parameter tuning it results in 12,170 bytes estimated, at the expense of a lot of memory. "Estimated" because the algorithm was originally meant to be recompresssed in a ZIP file, so it doesn't bother to generate the smallest JS file in terms of uncompressed size (yet). But that estimation does include the decoder size so it is a good estimate for the Kolmogorov complexity of the generated code though.

[1] https://lifthrasiir.github.io/roadroller/ (the exact parameters: golf.horse dataset; input mode text; action write to document; # contexts 12 with 12,15,49,50,70,79,96,97,131,154,292,353; pollute the global scope; max memory usage 150 MB; precision 16; learning rate 1333; model max count 11; model base divisor 14; dynamic model flags -1; # abbreviations 64)

I tried this out and got 12,231 bytes Brotli encoded, and 12,493 bytes gzipped, with 16,311 bytes raw -- so the estimate is very close.

It compresses better with new lines than if you remove them all (given you could just split on every 5 characters later), which is an odd quirk of compression algorithms that my brain will never quite grasp.

My best algorithm attempt + Brotli achieved 12,773 bytes, which is a painfully close 542 bytes away. It is 13,181 bytes raw though, and can technically be used in-memory, which is definitely a perk.

> It compresses better with new lines than if you remove them all (given you could just split on every 5 characters later), which is an odd quirk of compression algorithms that my brain will never quite grasp.

New lines give a usable context (namely the word boundary) to compression algorithms. If I give you an arbitrary unsorted list of 5-letter-long words with no delimiters you need to think harder to figure out that it is indeed a list of 5-letter-long words. Same for the compression algorithm.

> My best algorithm attempt + Brotli achieved 12,773 bytes, which is a painfully close 542 bytes away. It is 13,181 bytes raw though, and can technically be used in-memory, which is definitely a perk.

Yeah, the best solution depends on what you want to do with that. Your estimation is not too far from my experience: Roadroller tends to be on par with or slightly smaller than Brotli. Of course, Roadroller exists because web browsers generally don't provide a way to use Brotli in JS ;-)

Arithmetic coding is a drop-in replacement for Huffman coding that saves binary roundoff. It's less known because it was patented, and the (short) code to implement requires optimizing a tricky 1.0000 versus 0.9999 issue (in binary).

The usual application involves letter frequencies without context. Rather than a trie for deterministic context, one could in far less space compute a hidden Markov chain of small but effective dimension, to generate the probabilities for arithmetic coding.

Thanks, I'll look into it. This case is interesting though, because the Gameboy doesn't even have native mul/div operators, so I suspect that Huffman coding is as fancy as you're going to get while still having a small and efficient decoder that isn't taking up more space than its saving.
Huffman coding was the first thing that jumped into my mind, too. Reminds me of the time we implemented a subset of bzip2 on a CS class in highschool.
brotli on the raw word list gives 17194 bytes. gzip gives 32352.

A lot can be done in 3014 bytes, but what's the difference in code size for the ascii trie vs. a flat list/gzip/brotli?

A trie representation physically removes letters from the dataset. Leaving it in ASCII means that it still leaves enough information behind that can be compressed well (a trie only exploits shared prefixes, not suffixes).