| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bane 4546 days ago

Here's a fun example. Distribution of word length can give you insight into how deep a fast lexical lookup tree (trie) needs to be to capture most words. After that depth, you can fall back on a more memory efficient, but slower, structure like a hashtable.

In this case: at a depth of 6, a trie can handle ~75% of all words. At 5 it can handle ~67%. Since tries can grow exponentially in memory (fully populated), reducing a level and still getting about a 70% solution might be good enough. It's about an 8% reduction in the size of the representable lexicon. However if you go to length 4, you can only cover ~56% of words. Meaning there's a 45% chance that a given word won't be stored in the trie.

Supposing we set a desired metric that the trie needs to handle 70% of all words, then depth 5 is pretty reasonable and space efficient with only a 1/3 chance that a word that in our lexicon won't be in the trie.