Hacker News new | ask | show | jobs
by leiroigh 2783 days ago
It is unsurprising that a structure designed for range queries loses out against hashes for random point look-ups.

Thanks for posting that addition and very nice link. It is completely unclear why authors in this field even attempt to compete with hash-tables for random point queries: Trees (whether comparison or radix) make only sense when using either range queries or when faced with a query distribution that prefers locality.

Something I'd like to see is benchmarks of batch-query performance: If we can queue a couple of queries, then trees should gain a lot from processing them in-order (then, the last query has already warmed the cache for the next; this probably can pay for approximately sorting the batch).

4 comments

Authors try to compare their work against hash tables because, usually, HT represent an upper bound on the performance of point-lookups; we don't know how to do much better in general.

So if your data structure supports range queries _and_ point lookups, you should measure against hash tables to understand how far off the ideal you are. If it's not far, and your data structure is strictly more general, that's compelling.

Good point!
> Something I'd like to see is benchmarks of batch-query performance: If we can queue a couple of queries, then trees should gain a lot from processing them in-order (then, the last query has already warmed the cache for the next; this probably can pay for approximately sorting the batch).

I'm not sure if this is your point, but when you actually look at the uses of trees like the above (often: query processing) then the point look-up times are less relevant for implementations that do not serialize the look-ups.

Another place this shows up is when folks switch from closed-loop testing to open-loop testing (i.e. testing under a fixed rate of look-ups, rather than one or some few at a time). The latency-throughput trade-off curve makes trees look a lot less bad, but they still have a large minimal latency. A hash map with a sorted key (e.g. RobinHood hashing) still works better except at peak loads.

You can take this just a bit further and you find that replacing trees with a sorted list also works great (or, LSM of sorted lists, say). "Great" here just means "has some tolerable minimum latency, and peak throughput".

> or when faced with a query distribution that prefers locality

Isn't this a fairly common pattern? How would we go about quantifying the preferring locality-ness of a query distribution?

>How would we go about quantifying the preferring locality-ness of a query distribution?

For each level of the tree, treat the incoming stream of queries as a markov process where each state is a query that involves a certain node. So, if I have 7 nodes on level 2, I can build up a table of transition probabilities between vertices like "query involved node 3 on level 2" and "query involved node 7 on level 2." When the transitions between these vertices and themselves have high probability, the queries prefer locality. You can see which scale the locality is preferred on by doing this at each level of the tree.

>Isn't this a fairly common pattern?

Depends on the application. It is certainly not uncommon that queries fired at close times share a prefix.

>How would we go about quantifying the preferring locality-ness of a query distribution?

I know it when I see it?

Some other good reasons to use trees/tries, besides the ones you mention:

- reasonable worst-case behaviour (think real-time apps, or adverserial scenarios)

- persistent data structures (eg HAMT), as used in Clojure, Scala & Immutable.js

And the speed boost from locality gets more dramatic every passing year as the memory gap keeps widening.

>reasonable worst-case behaviour (think real-time apps, or adverserial scenarios)

I'm not sure I agree. When faced with potentially adversarial input, a hash should be keyed. And, depending on the specific tree, you can probably cause a lot of havoc by causing repeated tree reorganizations. Even worse, you have a timing side-channel, because node-split and node-merge are expensive, likely measurable over the network. This can leak info about existing adjacent keys in the database.

This kind of attack is afaik not as researched or well-known as hash collision based attacks, but in my book that is a point in favor of hash tables with keyed hashes.

On the other hand, both approaches shouldn't play in the same league: If you need to support range queries, then you need to support range queries.

>And the speed boost from locality gets more dramatic every passing year as the memory gap keeps widening.

This is true for scenarios where trees save cache misses (e.g. if there is a correlation between order of queries and location of keys). The linked comparison explicitly compared L3 misses between various trees and various hash tables for various scenarios, and the hash tables come out better in that metric.

All that being said, ART is pretty cool and node16's branch-free SIMD search is amazing.