Hacker News new | ask | show | jobs
by quicktwo 1576 days ago
I tried this method today, but a huge shortcoming here is that a DAWG needs these large pointers between nodes.

I based my approach on http://www.wutka.com/dawg.html and http://stevehanov.ca/blog/?id=115.

I generated a DAWG with 12,822 nodes, which means you need 14 bits for each pointer. A trie representation can be packed much smaller because you don't need to randomly jump around the graph, you can just read it out sequentially.

With huffman coded labels and offsets, I got the size down to approximately:

- 94,761 bits for offsets. - 56,900 bits for labels. - 12,822 bits for indicating when you're at the end of a next chain.

= 164,483 bits + Size of Huffman Table = ~20,560 bytes

I assumed I didn't need any bits for indicating end of word, because all Wordle words are length 5.

Meanwhile, bitpacked trie can get down to 15,599 bytes.

https://github.com/adamcw/wordle-trie-packing#all-words

It's not clear to me a path that will compress the DAWG so much that it could cut another 5000 bytes and whatever the Huffman table size is.

2 comments

I feared that (“This list is a lot shorter, so there will be fewer opportunities for savings”)

However, I think you can layout the tree so that no pointers point backwards. If so, can you make those offsets smaller by making them relative to the current point in the tree?

Also, since the list only has five-letter words, for the last letter, you don’t even need the letters themselves, just 26 bits for what letters can complete a word. That might be a saving.

Also, the crab source code is available (DEC: http://www.gtoal.com/wordgames/gatekeeper/crab.sh.txt, Mac: http://www.gtoal.com/wordgames/jacobson+appel/mac/Crab_sourc.... Both via http://www.gtoal.com/wordgames/scrabble.html)

Both are nice examples of C the way it is intended to be written, or rather, was intended to be written decades ago.

I don’t remember how that stores the data, but it might do a trick you didn’t think of.

And finally, I just realize that, for fairness, you need to look at (data size + decompressor size). Did you do that?

Ah, I replied to myself with more information while you were also replying.

I also surmise that the short length of the words makes a DAWG just very heavy.

It's not clear to me that relative offsets would be notably smaller to the extent that would be needed. Even a hypothetical and cheated DAWG I came up with is ~33% bigger than alternatives. I've generally explored enough (see the paper in my other comment) that I don't feel that further investigations into a DAWG are likely to outperform other methods.

I can't see anything immediately that jumps out that the Crab game is doing that's special to save space, I think it just achieves better compression because you can compress larger files easier, and the words are longer with more overlapping sections.

I agree that you need to compare including the decompressor size, so I'm not sure which approach is better the Huffman trie or the one in the original article. I'm not familiar enough with GB programming to be able to suggest how much program memory would be needed to decode the Huffman Trie, it looks like it would be somewhat similar in complexity.

Thanks for all the informational replies.
I ended up doing some math on a theoretical DAWG, based on: https://www.cs.put.poznan.pl/dweiss/site/publications/downlo...

With 12,822 nodes, you need 57,387 bits for the labels and the Huffman table (I'm sure you could make the Huffman table more efficient, but it's only 50 bytes, so that's not helping much).

Then, to mimic their edge reordering technique but without having to actually implement all the logic, I ordered the edges by frequency and used variable length integer encoding of size 3 (this performed the best on the data set) which required 95,988 bits.

Variable length integer encoding breaks the number into 3 bit chunks, each prefixed by 1 bit to indicate if there is another 4 bit chunk to read for that number. Since the distribution of offsets is heavily skewed, optimizing the most frequent offsets into a small package is better even if rarer ones suffer from multiple prefix bits.

This is 19,171 bytes total, or substantially worse than both the original article and Huffman tries do. This isn't even counting the flag bits needed for actually traversing the graph. So even cheating, it's not clear I can get a DAWG to be within striking distance of either other approach.

I hypothesize that the reason tries and other methods perform so well here is the relatively shallow depth. All words are only length 5, so the trie doesn't ever get really deep. This also means that suffixes generally don't actually take up that much space given common ones will also pack small with Huffman coding. The size of offsets appears to be just too great relative to how much you can save by removing shared suffixes from 5 letter words.

Would love to know if there is some trick to DAWG that I'm missing that would let me get it even smaller.