|
|
|
|
|
by bane
4048 days ago
|
|
Tries are amazing data structures, simple and extraordinarily fast -- O(m) on look ups. But they also eat memory at extraordinary rates as well. They're a classic speed vs. memory data structure. However, most people use naive Tries, just adding elements down a branch until they exhaust the string they're inserting. One easy optimization to make with Tries is to set a maximum branch length (based on some statistical analysis of lexeme usage. For example, make 90% of your lookups reachable under that length), any lexeme longer than that length simply gets hung off of the end of the branch in a more space-efficient data structure (like a hash table). Your lookup then is then still O(m) for anything under the maximum branch length, and things longer are still just O(m)+O(n) or whatever. But your memory usage will shrink dramatically. And you can improve it by fiddling with your branch length and choose say 80% reachable without hashing. |
|
At a little bit of a speed penalty you can use a bitmap for navigation instead of pointers, which obliterates the memory footprint. In some use cases the result is actually smaller than the equivalent Bloom filter with a sane FP rate (though generally not smaller than a Golomb-compressed set or cuckoo filter).
A compromise between the speed and memory bloat of pointers is to use Radix/Patricia tries, which are a solid go-to option. In some cases they'll be slower than naive tries, but in others they'll be faster.
The fastest and most memory-efficient implementation I've seen, though, boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit. Locality of reference can work wonders here -- we got a 20x speedup over an already very fast naive trie this way.