Hacker News new | ask | show | jobs
by jstimpfle 3460 days ago
Is this true? How is this more true for finger trees than for red-black trees, etc?
2 comments

As a rule, pointer-based data structures tend to be slower than array-based data structures (though, as with any general rule, there are exceptions). This is a simple consequence of how caching and prefetching works in modern CPUs. And this performance difference can easily amount to several (!) orders of magnitude.

However, there are a few caveats:

1. A garbage collector with a bump allocator can reduce these overheads significantly. So can a custom implementation (e.g. emulating pointers as offsets within an array).

2. The overhead often does not matter for performance due to the 90/10 law or because the application is not performance-critical to begin with.

3. Pointer-based data structures may support a richer set of operations than array-based data structures (or have significantly more efficient implementations for some operations).

4. Pointer-based data structures can have better worst-case or per-operation costs where array-based data structures rely on amortized or average performance to pull ahead. This can matter in soft real-time situations or when dealing with user-supplied data.

5. Certain types of data structures do not have good array-based implementations to begin with and some algorithms demand pointer-based implementations. (Think of spatial queries, for example.)

6. Performance cost is not always dominated by operations on the container, but can also be dominated by cost of operations on the items within the container. For example, there are algorithms where the most expensive operation (by far) is comparison between elements (e.g. for a priority queue implementing an expensive heuristic); in this case, you want to minimize the number of comparisons more than the cache behavior of the data structure itself (note that this may still result in you picking the array-based implementation).

7. Pointer-based data structures allow for the sharing of substructures, potentially reducing the memory footprint of an application. Binary decision diagrams are a popular example.

8. Some array-based data structures require a much larger working set than their corresponding pointer-based implementations and take a performance nosedive when the size of the working set exceeds the size of the cache. Especially when you have several such data structures in use at once. (This is where relying on microbenchmarks for performance comparison can be dangerous.)

Red-black trees are pretty much terrible for performance as well. I spend most of my work days trying to eradicate std::map and std::set for performance reasons.

Academics are always concerned about bounds, but practitioners are always concerned about cache misses.

I generally agree with this and most of the time it's hard to justify using a red-black tree if you care about performance. You can use one if you have other concerns perhaps or performance isn't one of them.

I also concur cache misses are vital and sometimes don't get enough attention since people pass hand wave them away as micro-benchmarking which isn't true often for critical code paths. I'll add that again, this is why testing on your target hardware is important. You may think you know, but often you don't. I've seen too much stupid stuff where someone wrote code that performs great on their development desktop and awful on a server or a game console because of different CPU architecture and cache design or sizes.

Maybe someone can dig it up, but I've seen some of the sub-variants of red-black trees have dramatically better performance than I thought possible (or worse, ex: left-leaning), and the results can vary across runtimes. In addition to cache misses, there can be some other considerations for wanting to use a red-black tree like concurrent or parallel programming. In these cases, some of the cousin data structures like AVL trees can swing you for or against a red-black tree or other backing structure. I had this come up recently when selecting an interval tree that needed to be accessed between threads for instance.

Selecting a data structure is harder than most people make it out to be. There's just a lot of considerations at work, so you need to balance use-cases, performance (against those cases and raw), access patterns, allocations, and so on to come to a decision. Sometimes if you add up the total cost of doing something like replacing a red-black tree with other methods that are more cache friendly or have some other characteristic that is better, the overall performance cost can end up being worse in aggregate. So the answer is the same as always, "it depends" and the follow-up, "try it."

Parent is definitely right that academic and practical concerns of those of us in the trenches being different. That's why arguing with pure textbook evidence is stupid and you should challenge anyone who does it. Textbooks are a starting point for making and justifying a conclusion, not a shortcut.

When you don't need persistence, a search tree data structure with fatter nodes (e.g., B-trees) is almost guaranteed to perform better on real hardware. But when you do need persistence, red-black trees become competitive again.
I would expect B-trees to outperform binary trees even in a persistent data structure. The necessity of binary trees comes when you need stable iterators/references.
Theory academics are primarily concerned with bounds, but systems academics are very much concerned with cache misses. I've done work with colleagues where we were very concerned with both.