Hacker News new | ask | show | jobs
by wjakob 2355 days ago
What's wrong with red-black trees? :) Since you mention C++: you do realize that std::map will typically be implemented with one?
4 comments

You want more fanout to use the entire cache line or even the entire virtual memory page. RB trees assume access to any address in memory has equal cost, which is extremely incorrect.
The red-black tree implementation typically used in *BSD development, <sys/tree.h>, is an intrusive data structure; tree nodes are embedded within the object you must dereference for key comparison. There's no need to optimize fanout because you've entirely removed that extra indirection.

As a general purpose data structure implementation for systems development, <sys/tree.h> is quite nice. However, for any particular task you can always optimize the data structure, or choose an entirely different data structure, to better fit the problem. Which is perhaps why the Linux kernel has so many different tree and hash implementations.

std::map and std::set are famously terrible, for that reason: https://youtu.be/fHNmRkzxHWs?t=2695
Indeed. C++ needs some better Standard containers. As it is, you need 3rd party library containers when you need optimal performance on ordered collections. Abseil and boost have alternatives.

The great advantage C++ has is that there is no temptation to open-code them. Library implementations can absorb immense optimization and testing effort, amortized over all uses, and delivered without compromise.

Horrible cache utilization.
Compared to what?
If you're not interleaving insertions and lookups, then a sorted vector can be really good.

If you need to interleave insertions and lookups, but don't need to traverse in sorted order, then a good hash table (not std::unordered_map) is normally the best option.

You only need a tree if you need traversal in sorted order and interleaved insertions and lookups, which is pretty uncommon. Even then you are almost always better off with a B-tree than a red-black tree.

Even with interleaved insertions and lookups, there are common scenarios that make a sorted vector still significantly more performant. I wrote about it when we published our open source SortedList<T> implementation for .NET and .NET Core [0], specifically comparing it to AVL and Red-Black trees.

[0]: https://neosmart.net/blog/2019/sorted-list-vs-binary-search-...

That's a good point, and a really interesting blog post.
Seems like these criteria miss the use case for mm_rb, one of the central red-black trees in linux.

If you need to be able to store intervals (i.e. virtual memory areas) and do lookups based on any address in the interval, not just the base address, I don't think a hash map is the best option.

IMO "lookups based on any address in the interval" requires "traversal in sorted order", although I could have probably been more precise in my terms.

I would be curious if anyone has ever profiled the impact of changing mm_rb to a B-tree. It might be very difficult if existing code that uses mm_rb depends on pointer stability, though.

I believe Splay trees are usually mentioned as an alternative.
Aren't splay trees terrible because they turn all reads into writes? (Which means the cache lines bounce between readers, instead of being shared as they would be with pure reads.)
Yeah, having reads rebalancing the tree in a multithreaded subsystem is probably not optimal. Might be that the Splay-trees have outstayed their welcome :)

FreeBSD’s tree.h used to have, iirc, both RB- and Splay-trees.

A B-tree set, for example.
std::set too, which really is just a map where the key and value are the same.