| I tried this method today, but a huge shortcoming here is that a DAWG needs these large pointers between nodes. I based my approach on http://www.wutka.com/dawg.html and http://stevehanov.ca/blog/?id=115. I generated a DAWG with 12,822 nodes, which means you need 14 bits for each pointer. A trie representation can be packed much smaller because you don't need to randomly jump around the graph, you can just read it out sequentially. With huffman coded labels and offsets, I got the size down to approximately: - 94,761 bits for offsets.
- 56,900 bits for labels.
- 12,822 bits for indicating when you're at the end of a next chain. = 164,483 bits + Size of Huffman Table
= ~20,560 bytes I assumed I didn't need any bits for indicating end of word, because all Wordle words are length 5. Meanwhile, bitpacked trie can get down to 15,599 bytes. https://github.com/adamcw/wordle-trie-packing#all-words It's not clear to me a path that will compress the DAWG so much that it could cut another 5000 bytes and whatever the Huffman table size is. |
However, I think you can layout the tree so that no pointers point backwards. If so, can you make those offsets smaller by making them relative to the current point in the tree?
Also, since the list only has five-letter words, for the last letter, you don’t even need the letters themselves, just 26 bits for what letters can complete a word. That might be a saving.
Also, the crab source code is available (DEC: http://www.gtoal.com/wordgames/gatekeeper/crab.sh.txt, Mac: http://www.gtoal.com/wordgames/jacobson+appel/mac/Crab_sourc.... Both via http://www.gtoal.com/wordgames/scrabble.html)
Both are nice examples of C the way it is intended to be written, or rather, was intended to be written decades ago.
I don’t remember how that stores the data, but it might do a trick you didn’t think of.
And finally, I just realize that, for fairness, you need to look at (data size + decompressor size). Did you do that?