Hacker News new | ask | show | jobs
by Robin_Message 5779 days ago
Good luck building a Unicode trie -- the branching factor would be too high, never mind lookup time. Instead, you'd make the trie of an encoding, probably UTF-8 (off the top of my head) that would enable you to keep the branching factor at 256, which is already rather large but doable (You can switch to Judy arrays if the wasted space bothers you.)

Does anyone know, is there a Unicode encoding that enables you to map arbitrary ranges (so I can, for example, use the greek alphabet only at 1 byte per character or less)? I suppose UTF-8 is already hard enough to decode.

3 comments

Ternary search tries attempt to solve this problem (among others):

http://en.wikipedia.org/wiki/Ternary_search_tree

http://www.strchr.com/ternary_dags

Or you use a sparse data structure instead of an array in each node, for example a binary tree. This gives you ternary trees.
> Instead, you'd make the trie of an encoding, probably UTF-8

There are some new implications you need to account for if you do this, as you no longer have one node per letter, but a path per letter. E.g. subtrees will no longer correctly represent substrings, making autocompletion slightly trickier.