Hacker News new | ask | show | jobs
by bane 4048 days ago
Tries are amazing data structures, simple and extraordinarily fast -- O(m) on look ups. But they also eat memory at extraordinary rates as well. They're a classic speed vs. memory data structure.

However, most people use naive Tries, just adding elements down a branch until they exhaust the string they're inserting.

One easy optimization to make with Tries is to set a maximum branch length (based on some statistical analysis of lexeme usage. For example, make 90% of your lookups reachable under that length), any lexeme longer than that length simply gets hung off of the end of the branch in a more space-efficient data structure (like a hash table).

Your lookup then is then still O(m) for anything under the maximum branch length, and things longer are still just O(m)+O(n) or whatever.

But your memory usage will shrink dramatically. And you can improve it by fiddling with your branch length and choose say 80% reachable without hashing.

3 comments

Indeed there are a lot of good optimization options for tries.

At a little bit of a speed penalty you can use a bitmap for navigation instead of pointers, which obliterates the memory footprint. In some use cases the result is actually smaller than the equivalent Bloom filter with a sane FP rate (though generally not smaller than a Golomb-compressed set or cuckoo filter).

A compromise between the speed and memory bloat of pointers is to use Radix/Patricia tries, which are a solid go-to option. In some cases they'll be slower than naive tries, but in others they'll be faster.

The fastest and most memory-efficient implementation I've seen, though, boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit. Locality of reference can work wonders here -- we got a 20x speedup over an already very fast naive trie this way.

> boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit

Oh that's interesting - like a very fast cmp before bothering to check the rest of the branches.

When you say pattern are you talking about specific letters/morphemes or subsequence? And were you pulling the [dq]words from the initial pattern or generating them some otherway (through some kind of hash or encoding function?).

Sorry, should've given more context. In this case the trie was for Aho-Corasick, so we built it from the most unique subsequences. Since there was no need to add or remove patterns at runtime, the subsequences were precomputed and the trie built server-side, then sent serialized down to the application.

On a different occasion I tried a few kinds of entropy coding, but their value was pretty limited, and obviously not applicable if you need to operate on a stream with Aho. In that case I only needed memory-optimized set membership testing, though, so ended up going with a Golomb-compressed set from pgs. 7-8 of [0].

[0] http://algo2.iti.kit.edu/singler/publications/cacheefficient...

They can eat memory at extraordinary rates, but I've successfully used them to reduce memory usage as well. If you need to store a mess of strings that tend to have a lot of duplication toward the front - URIs, for example - then a trie might fare pretty well in that department.
A DAWG can be even better, at the expense of considerable setup cost.

Though a naive Trie can be converted to a DAWG incrementally and on-the-fly relatively easily. If you're careful, you can even do this on another thread in the background.

One thing about tries that's really nice is that you can transparently combine nodes to turn it into a DAWG.

Especially as you can do it on-the-fly. Memory usage is getting excessive? Stop and do a suffix combination pass until it's decent again. Otherwise? Don't bother.