Hacker News new | ask | show | jobs
by jandrewrogers 1794 days ago
There is, but it isn’t what I would call intuitive. Traditional tree-based indexing algorithms embed a diverse set of assumptions about the nature of the hardware that aren’t actually true in modern systems. SSDs are also quasi-sequential devices in important ways that need to be respected for performance reasons. The design problem is that storage density and bandwidth has increased massively and traditional indexing algorithms are poorly suited for that environment.

The relative memory wastefulness of many indexing algorithms also means that you can’t load the indexes into memory on machines with high storage density, they literally won’t fit. This leads to a page fault cascade in practice that causes a very rapid degradation of performance beyond a relatively ordinary scale.

Modern indexing algorithms (not the ones in the article) give up a small amount of selectivity and balanced trees in exchange for reducing the memory footprint of indexes by multiple orders of magnitude and greatly increasing write scalability. Given the storage bandwidth available in modern systems, that is a big win in terms of throughput. Balanced trees are an anachronism generally in current design; it doesn’t matter how unbalanced an index tree is if it fits in CPU cache while searching countless terabytes of data.

20 years ago, I was using all of the indexing structures in the article for high-performance systems. They started going out of style about 10 years ago and I honestly can’t remember the last time I worked on a new system that implemented them. That transition happened slowly and then all at once.

1 comments

I recently started learning about the underlying data structures in databases. What are some examples of those "modern indexing algorithms"?
Most modern indexing structures are in the succinct radix tree algorithm family. This is a diverse algorithm space with many design knobs that can be used to optimize properties and behavior in multiple dimensions, so any two algorithm implementations can look quite different. Most algorithms of this type you see in the wild don't have names, they were designed to fit requirements from first principles.

You can find simple, albeit limited and not too succinct, examples of these types of data structures online. More advanced capabilities crammed into ever more succinct representations can quickly become difficult to reason about -- the code looks much more abstractly information theoretic and less like a traditional index.