| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by leif 4703 days ago
	Tries (actually pronounced "trees" or "Patricia trees") are very memory efficient compared to most alternatives, but really only shine in membership query workloads. If you actually need to return the stored element, you have to rebuild it for every query, which is probably too expensive if you have lots of queries, it would be better to store the whole object continuously somewhere in the data structure so you could return a reference to it. Even then, tries only win if the distribution is suitable to give you the memory efficiency you are hoping for. If you don't have many common prefixes, you're better off with just a hashtable.

5 comments

nly 4703 days ago

If your key type is lexically sorted, you can also do prefix searches in a binary search tree.

link

barrkel 4703 days ago

Prefix searches and range queries in a large corpus with a lot of shared prefixes - a file system listing cache - have been my primary reason for using tries.

You can answer questions like "how many files does this directory have, immediately in it" and also "under it recursively" relatively quickly, and if you use a Patricia trie with a little cleverness in the compression of the text, it needn't take up much space at all even for millions of paths. Whereas a binary tree isn't so great at the memory saving bit.

link

acbellini 4703 days ago

Yes, I used that. It's what the ceiling() method in the binary tree in the article does, it looks for the prefix. Still, it's slower :-)

link

acbellini 4703 days ago

I have reported the "try" pronunciation because that's the way I have learnt them, and also because that's how Sedgewick's book explicitly uses, but I guess there are alternatives :-)

Anyway, you just got right the point that I was trying to make, a trie is definitely not a general purpouse data structure, but for the specific case (limited number of elements, many common prefixes etc) it performs very well. As for the point of returning the original object, it suffices to store the original object in end-word nodes. I didn't do that in this code because it would have added unnecessary code, since the problem was really just membership and prefixes.

In the specific case, hashtable wouldn't have worked, since it's not sorted and it doesn't solve the original problem.

link

chatman 4703 days ago

For set membership checks, bloom filters are way more memory efficient. And almost accurate. Lots of bloom filters of different sizes holding the same data can bring down the false positives to negligible.

link

GhotiFish 4703 days ago

Is having bloom filters of sizes 1024, 2048, and 4096 more dependable than one bloom filter of 8192?

That is surprising to me.

link

leif 4703 days ago

It's a similar strategy to having one bloom filter with many hash functions, but the math is a little different. Nearly the same effect though.

link

zheng 4703 days ago

Thanks, while I think the pronunciation is lame and confusing, I grow tired of explaining to people that it is actually correct, they aren't called "trys" or "treys".

link

nly 4703 days ago

Pronunciation can be a very individual or regional thing when it comes to technical abbreviations. I just finished watching the Channel9 GoingNative talks (C++), and was surprised to find a couple of the most respected C++ guys pronounce 'ptr' as 'put-er' (not 'putter') instead of 'pointer'.

Personally I'll continue to pronounce it as 'try'. It has, at least, fewer conflicting interpretations in a programming context. When I say 'tree', people will probably assume I mean a binary tree, not a radix tree. So, if I'm not going at 30% light speed, I'll just say that.

link

leif 4703 days ago

Mispronouncing variables is great fun. We have lots of "txn_XXX" which we pronounce "texan_XXX", and "le_XXX" (short for "leafentry XXX", as in "le_key"), which always make us sound like we're mocking French people.

link

nly 4703 days ago

Now you're getting it. I also pronounce 'char' from C as if it's short for charred, rather than character. Principally because I find it easier easier to say 'char star' with that vocalisation. I know one guy who, if speaking quickly, would pronounce 'char* sugar' as 'caster sugar'... come to think of it, I'm not sure how Americans pronounce 'caster', but in the UK it's car-stir sugar. Not 'Casper', as in the friendly ghost, but with a 't'.

I've never had a problem with people using different pronunciations. It certainly keeps talking about code fresh.

link

bjeanes 4703 days ago

cache is a fun one (cash or kay-sh?). Most Australians I know (and possibly people from the UK too) say it as "cash"

link

DigitalJack 4703 days ago

Under what rules of pronunciation would it be Kay-sh?

I recognize pronunciation is more about exceptions than rules, but even so...

link

gruseom 4703 days ago

Heh. When we have a variable name with a question mark to denote a boolean, as in "ready?", we pronounce the question mark "what" like an old-school English gentleman—so, "readywhat", or rather: "Ready, what?" It still cracks me up sometimes.

link

jloughry 4703 days ago

I was taught by Evi Nemeth (requiescat in pace) that trie pronounced like "try" and to use it that way.

link

leif 4703 days ago

The reason is because the name comes from the word "reTRIEval," which was a pretty dumb move in the first place.

link

scott_s 4703 days ago

The great thing about pronunciation is that whatever most people say, wins. I'm going to keep using the pronunciation that rhymes with "dyes" so we can avoid confusion with the more general structure.

link

Kiro 4703 days ago

They are called "trys" because that's how most people pronounce it.

link

tptacek 4702 days ago

I just call them all radix trees.

link

AaronFriel 4703 days ago

You're only looking at one kind of trie, there are also Judy arrays, which have much better performance characteristics and (if we're abusing the notation) the same worst-case memory complexity.

link

jemfinch 4703 days ago

Judy arrays are a straightforward, highly optimized implementation of a HAMT.

Unfortunately, they were highly optimized for 32-bit machines.

link