Hacker News new | ask | show | jobs
by supo 4656 days ago
With a basic Trie, you always have this problem of how to keep the pointers at each node. Naively, you can:

1) Keep an array of size equal to the alphabet size, indexed by the characters, holding pointers to successive nodes.

- but this blows the memory to O(N * alphabet_size) where N is the input size

2) Keep a linked list of pairs (character, pointer).

- but this blows the lookup time to O(word_length * alphabet_size)

3) Keep something like an STL map<character, pointer>.

- still suboptimal complexity of O(word_length * log alphabet_size) + it complicates the code + it dramatically increases the constant time per operation

Or you research a bit more and actually learn a way to properly implement a trie, called a Ternary Search Tree: http://en.wikipedia.org/wiki/Ternary_search_tree

2 comments

Or you decide that your problem is small enough that the poor memory complexity doesn't matter much.
Sure, but it can blow up faster than you think. Say you have 100k URL strings, up to 128 chars each, with 64 allowed characters in alphabet.

- storing this at byte per char = 12.8MB

- storing this in a basic trie = 12.8 * 64 = 820MB

- storing this in the TST = 12.8 * 3 = 38MB

Sure, I was thinking on the order of <100 strings around 5-10 characters each at most. I agree that the TST is much better suited for large sets of strings. Also I think more developers should actually do the calculations even if they only have an estimate of how large the inputs will be.
Ah, in that case it goes straight to set<string> for me :-) I think that in practice you want to do that but then drop in a few asserts to check whether things are not getting out of control once you forgot about this decision.
Is this taking into account any of the memory savings that would be seen from common paths using the same memory? If a lot of entries have similar prefixes, this can lead to a lot of deduplication.
Depends on the randomness of the paths. With URLs coming from a limited set of websites - yes, there would be some savings.

With a set of random strings, for alphabet with 64 chars, there are 16M different 4 character prefixes, so the savings from overlapping prefixes for 128char long strings is likely less than 3%.