Hacker News new | ask | show | jobs
by praptak 5782 days ago
"First, lookup time is O(1) in the size of the trie."

The article skims over the tricky part - the dependency on the size of the alphabet. Which you can no longer treat as insignificant in the days of unicode.

2 comments

Good luck building a Unicode trie -- the branching factor would be too high, never mind lookup time. Instead, you'd make the trie of an encoding, probably UTF-8 (off the top of my head) that would enable you to keep the branching factor at 256, which is already rather large but doable (You can switch to Judy arrays if the wasted space bothers you.)

Does anyone know, is there a Unicode encoding that enables you to map arbitrary ranges (so I can, for example, use the greek alphabet only at 1 byte per character or less)? I suppose UTF-8 is already hard enough to decode.

Ternary search tries attempt to solve this problem (among others):

http://en.wikipedia.org/wiki/Ternary_search_tree

http://www.strchr.com/ternary_dags

Or you use a sparse data structure instead of an array in each node, for example a binary tree. This gives you ternary trees.
> Instead, you'd make the trie of an encoding, probably UTF-8

There are some new implications you need to account for if you do this, as you no longer have one node per letter, but a path per letter. E.g. subtrees will no longer correctly represent substrings, making autocompletion slightly trickier.

Well, the version of this I did in C for the shop I was at about 10 years ago had a "memcmp"-like interface. Just pass in a pointer to the start of the bytes (int, asciiz, unicode if you will) and the length.

I also dynamically allocated the key/pointer arrays within each node so that while it was sparsely populated, it was only big enough to hold the largest byte defined in that node (e.g. - 'A' = (char) 65, so byte positions 0 to 65 would be present, but not 66 to 255 until needed.

I was nice to see the impression of my coworkers when a bunch of qsort() / bsearch() code was replaced with this. We could afford the memory, and the speedup was fairly impressive.