Hacker News new | ask | show | jobs
by leif 4656 days ago
Tries (actually pronounced "trees" or "Patricia trees") are very memory efficient compared to most alternatives, but really only shine in membership query workloads. If you actually need to return the stored element, you have to rebuild it for every query, which is probably too expensive if you have lots of queries, it would be better to store the whole object continuously somewhere in the data structure so you could return a reference to it.

Even then, tries only win if the distribution is suitable to give you the memory efficiency you are hoping for. If you don't have many common prefixes, you're better off with just a hashtable.

5 comments

If your key type is lexically sorted, you can also do prefix searches in a binary search tree.
Prefix searches and range queries in a large corpus with a lot of shared prefixes - a file system listing cache - have been my primary reason for using tries.

You can answer questions like "how many files does this directory have, immediately in it" and also "under it recursively" relatively quickly, and if you use a Patricia trie with a little cleverness in the compression of the text, it needn't take up much space at all even for millions of paths. Whereas a binary tree isn't so great at the memory saving bit.

Yes, I used that. It's what the ceiling() method in the binary tree in the article does, it looks for the prefix. Still, it's slower :-)
I have reported the "try" pronunciation because that's the way I have learnt them, and also because that's how Sedgewick's book explicitly uses, but I guess there are alternatives :-)

Anyway, you just got right the point that I was trying to make, a trie is definitely not a general purpouse data structure, but for the specific case (limited number of elements, many common prefixes etc) it performs very well. As for the point of returning the original object, it suffices to store the original object in end-word nodes. I didn't do that in this code because it would have added unnecessary code, since the problem was really just membership and prefixes.

In the specific case, hashtable wouldn't have worked, since it's not sorted and it doesn't solve the original problem.

For set membership checks, bloom filters are way more memory efficient. And almost accurate. Lots of bloom filters of different sizes holding the same data can bring down the false positives to negligible.
Is having bloom filters of sizes 1024, 2048, and 4096 more dependable than one bloom filter of 8192?

That is surprising to me.

It's a similar strategy to having one bloom filter with many hash functions, but the math is a little different. Nearly the same effect though.
Thanks, while I think the pronunciation is lame and confusing, I grow tired of explaining to people that it is actually correct, they aren't called "trys" or "treys".
Pronunciation can be a very individual or regional thing when it comes to technical abbreviations. I just finished watching the Channel9 GoingNative talks (C++), and was surprised to find a couple of the most respected C++ guys pronounce 'ptr' as 'put-er' (not 'putter') instead of 'pointer'.

Personally I'll continue to pronounce it as 'try'. It has, at least, fewer conflicting interpretations in a programming context. When I say 'tree', people will probably assume I mean a binary tree, not a radix tree. So, if I'm not going at 30% light speed, I'll just say that.

Mispronouncing variables is great fun. We have lots of "txn_XXX" which we pronounce "texan_XXX", and "le_XXX" (short for "leafentry XXX", as in "le_key"), which always make us sound like we're mocking French people.
Now you're getting it. I also pronounce 'char' from C as if it's short for charred, rather than character. Principally because I find it easier easier to say 'char star' with that vocalisation. I know one guy who, if speaking quickly, would pronounce 'char* sugar' as 'caster sugar'... come to think of it, I'm not sure how Americans pronounce 'caster', but in the UK it's car-stir sugar. Not 'Casper', as in the friendly ghost, but with a 't'.

I've never had a problem with people using different pronunciations. It certainly keeps talking about code fresh.

cache is a fun one (cash or kay-sh?). Most Australians I know (and possibly people from the UK too) say it as "cash"
Under what rules of pronunciation would it be Kay-sh?

I recognize pronunciation is more about exceptions than rules, but even so...

Heh. When we have a variable name with a question mark to denote a boolean, as in "ready?", we pronounce the question mark "what" like an old-school English gentleman—so, "readywhat", or rather: "Ready, what?" It still cracks me up sometimes.
I was taught by Evi Nemeth (requiescat in pace) that trie pronounced like "try" and to use it that way.
The reason is because the name comes from the word "reTRIEval," which was a pretty dumb move in the first place.
The great thing about pronunciation is that whatever most people say, wins. I'm going to keep using the pronunciation that rhymes with "dyes" so we can avoid confusion with the more general structure.
They are called "trys" because that's how most people pronounce it.
I just call them all radix trees.
You're only looking at one kind of trie, there are also Judy arrays, which have much better performance characteristics and (if we're abusing the notation) the same worst-case memory complexity.
Judy arrays are a straightforward, highly optimized implementation of a HAMT.

Unfortunately, they were highly optimized for 32-bit machines.