Hacker News new | ask | show | jobs
by rchrch 1587 days ago
Did you compare this to compression with a trie?
2 comments

I imagine only having to store 5 bits for the differences accounts for a large swath of the savings. With the first letter bucketing the rest of the words in a sort of trie-like behaviour, I wonder if there's enough duplication of successive characters for a full trie to be worth it. After all, there would need to be extra bits used to store leaf/node identity, as well as whether there are _more_ nodes past the current leaf. I suppose instead of using markers to denote ends, the parent node could store a count, but that might be wasteful for buckets at the fourth character.

I'm very curious about this.

See step 1; the first level is stored as a trie. Then, see Notes, where it's stated that trie-ifying level 2 requires more space.