Hacker News new | ask | show | jobs
by quicktwo 1574 days ago
I ended up doing some math on a theoretical DAWG, based on: https://www.cs.put.poznan.pl/dweiss/site/publications/downlo...

With 12,822 nodes, you need 57,387 bits for the labels and the Huffman table (I'm sure you could make the Huffman table more efficient, but it's only 50 bytes, so that's not helping much).

Then, to mimic their edge reordering technique but without having to actually implement all the logic, I ordered the edges by frequency and used variable length integer encoding of size 3 (this performed the best on the data set) which required 95,988 bits.

Variable length integer encoding breaks the number into 3 bit chunks, each prefixed by 1 bit to indicate if there is another 4 bit chunk to read for that number. Since the distribution of offsets is heavily skewed, optimizing the most frequent offsets into a small package is better even if rarer ones suffer from multiple prefix bits.

This is 19,171 bytes total, or substantially worse than both the original article and Huffman tries do. This isn't even counting the flag bits needed for actually traversing the graph. So even cheating, it's not clear I can get a DAWG to be within striking distance of either other approach.

I hypothesize that the reason tries and other methods perform so well here is the relatively shallow depth. All words are only length 5, so the trie doesn't ever get really deep. This also means that suffixes generally don't actually take up that much space given common ones will also pack small with Huffman coding. The size of offsets appears to be just too great relative to how much you can save by removing shared suffixes from 5 letter words.

Would love to know if there is some trick to DAWG that I'm missing that would let me get it even smaller.