Hacker News new | ask | show | jobs
by jamii 2783 days ago
My experience has been that the vast majority of papers on data-structures are at best misleading, and at worst deliberately biased.

For example:

> The hash table used by the authors of ART in their study was a chained hash table, but this kind of hash tables can be suboptimal in terms of space and performance due to their potentially high use of pointers.

> Our experiments strongly indicate that neither ART nor Judy are competitive to the aforementioned hashing schemes in terms of performance, and, in the case of ART, sometimes not even in terms of space.

https://www.victoralvarez.net/papers/A%20Comparison%20of%20A...

3 comments

Funny enough, even that paper seems biased as well to me.

This paper probably applies best when you use the HT or tree as a DB index data-structure without constant updates for huuuge datasets, probably not as well when used as an updated associative map in your average program. I'd guess hash-tables (HT) are still faster in day-to-day programmer use but not as much as pointed out here.

Caveats:

- HTs pre-allocated: In the two experiments that show those detailed bar-charts, the hash-tables are completely pre-allocated, so no re-allocation and re-hashing on inserts.

- Charts: The workloads IV-C and IV-E get by far the most attention (3/4 page, 9 bar charts - each), these best support the paper's statements that HTs are much faster and consume less mem than ART. The mixed-use case IV-D is reported with only a single small chart, the whole write-up is just about 1/4 page combined and ART does better (comparable, but still slower than fastest HT) - arguably IV-D is probably most realistic use-case for how most people use hashmaps day-to-day.

- The keys are 64-bit ints so aren't all that long, again maybe common in a DB index but not really what you use hashmaps for mostly in the real world. So ART can't play out some of its strengths.

- The data set skews large, larger than the original paper (16M to 1,000,000,000 (1Billion) keys), and most people probably foremost look at the charts on the right with the huge numbers that make HT look best. Again, makes the results not super applicable to day-to-day hashmap use imo.

- They only look up existing keys, probably not great for either ART nor HT but could hurt their CH*Bucket performance.

Most of these caveats are explained in the test, but some are kind of buried. You can't just look at the charts and conclude hash-tables are 5x as fast as ART, always.

It is unsurprising that a structure designed for range queries loses out against hashes for random point look-ups.

Thanks for posting that addition and very nice link. It is completely unclear why authors in this field even attempt to compete with hash-tables for random point queries: Trees (whether comparison or radix) make only sense when using either range queries or when faced with a query distribution that prefers locality.

Something I'd like to see is benchmarks of batch-query performance: If we can queue a couple of queries, then trees should gain a lot from processing them in-order (then, the last query has already warmed the cache for the next; this probably can pay for approximately sorting the batch).

Authors try to compare their work against hash tables because, usually, HT represent an upper bound on the performance of point-lookups; we don't know how to do much better in general.

So if your data structure supports range queries _and_ point lookups, you should measure against hash tables to understand how far off the ideal you are. If it's not far, and your data structure is strictly more general, that's compelling.

Good point!
> Something I'd like to see is benchmarks of batch-query performance: If we can queue a couple of queries, then trees should gain a lot from processing them in-order (then, the last query has already warmed the cache for the next; this probably can pay for approximately sorting the batch).

I'm not sure if this is your point, but when you actually look at the uses of trees like the above (often: query processing) then the point look-up times are less relevant for implementations that do not serialize the look-ups.

Another place this shows up is when folks switch from closed-loop testing to open-loop testing (i.e. testing under a fixed rate of look-ups, rather than one or some few at a time). The latency-throughput trade-off curve makes trees look a lot less bad, but they still have a large minimal latency. A hash map with a sorted key (e.g. RobinHood hashing) still works better except at peak loads.

You can take this just a bit further and you find that replacing trees with a sorted list also works great (or, LSM of sorted lists, say). "Great" here just means "has some tolerable minimum latency, and peak throughput".

> or when faced with a query distribution that prefers locality

Isn't this a fairly common pattern? How would we go about quantifying the preferring locality-ness of a query distribution?

>How would we go about quantifying the preferring locality-ness of a query distribution?

For each level of the tree, treat the incoming stream of queries as a markov process where each state is a query that involves a certain node. So, if I have 7 nodes on level 2, I can build up a table of transition probabilities between vertices like "query involved node 3 on level 2" and "query involved node 7 on level 2." When the transitions between these vertices and themselves have high probability, the queries prefer locality. You can see which scale the locality is preferred on by doing this at each level of the tree.

>Isn't this a fairly common pattern?

Depends on the application. It is certainly not uncommon that queries fired at close times share a prefix.

>How would we go about quantifying the preferring locality-ness of a query distribution?

I know it when I see it?

Some other good reasons to use trees/tries, besides the ones you mention:

- reasonable worst-case behaviour (think real-time apps, or adverserial scenarios)

- persistent data structures (eg HAMT), as used in Clojure, Scala & Immutable.js

And the speed boost from locality gets more dramatic every passing year as the memory gap keeps widening.

>reasonable worst-case behaviour (think real-time apps, or adverserial scenarios)

I'm not sure I agree. When faced with potentially adversarial input, a hash should be keyed. And, depending on the specific tree, you can probably cause a lot of havoc by causing repeated tree reorganizations. Even worse, you have a timing side-channel, because node-split and node-merge are expensive, likely measurable over the network. This can leak info about existing adjacent keys in the database.

This kind of attack is afaik not as researched or well-known as hash collision based attacks, but in my book that is a point in favor of hash tables with keyed hashes.

On the other hand, both approaches shouldn't play in the same league: If you need to support range queries, then you need to support range queries.

>And the speed boost from locality gets more dramatic every passing year as the memory gap keeps widening.

This is true for scenarios where trees save cache misses (e.g. if there is a correlation between order of queries and location of keys). The linked comparison explicitly compared L3 misses between various trees and various hash tables for various scenarios, and the hash tables come out better in that metric.

All that being said, ART is pretty cool and node16's branch-free SIMD search is amazing.