Hacker News new | ask | show | jobs
by jandrewrogers 1456 days ago
> "Scale of data often works against you, and balanced trees are the first tool in your arsenal against it."

An ironic caveat to this is that balanced trees don't scale well, only offering good performance across a relatively narrow range of data size. This is a side-effect of being "balanced", which necessarily limits both compactness and concurrency.

That said, concurrent B+trees are an absolute classic and provide important historical context for the tradeoffs inherent in indexing. Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.

3 comments

> Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.

This is pure nonsense. B+Trees are used extensively and by default by 5 out of 5 of the top database systems, according to db-engines.com.

You don't actually address the point.

If your database engine is an old design or your data is small by modern standards, then a B+tree will be one of the few indexing algorithms available and if the data is small it will probably work. Modern database kernels targeting modern hardware and storage densities typically aren't using B+trees and the reasons why are well-understood. No one with any sense is using a B+tree to index e.g. a trillion records, which is a pretty ordinary thing to do on a single server in 2022.

You can't just swap out indexing architectures due to their dependency on storage engine and scheduling behavior, so older databases like PostgreSQL will be using B+trees for the indefinite future even if suboptimal.

The transition away from B+tree based architectures in new databases engines started about 10-15 years ago. Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.

> so older databases like PostgreSQL will be using B+trees for the indefinite future even if suboptimal.

PostgreSQL 14 comes with 6 builtin index types[1]: B-tree, Gist, SP-Gist, Gin, Brin, and Hash. More can be plugged in as extensions.

[1]: Chapters 63-69 of https://www.postgresql.org/docs/14/internals.html

Edited: Fixed the link to version 14.

> You don't actually address the point.

You said that B-Trees "use in indexing has dwindled with time". This is demonstrably false.

> Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.

Even if that was true (which it definitely isn't), why would anybody judge the commercial or scientific relevance of B-Trees by looking at what new systems do? There are very few new systems that are intended to be competitive as general purpose systems, which is where most of the market is.

You still haven't actually named a single example of a "modern database kernel" that exemplifies what you're talking about.

You are understating the limitations of B+trees for real workloads. A common and growing problem is the lack of online indexing that scales, the particularly data model doesn't matter that much. Index construction throughput and scaling has been a serious problem at some pretty boring companies I've done work for.

Use of B+trees in new database kernels has definitely diminished. I'm not counting the installed base of SQLite etc. Ubiquity doesn't make something the pinnacle of technology -- just as often it means "legacy installed base". I still use PostgreSQL a lot and mod it when I need to but I am not under any illusions about its limitations.

A "modern" database kernel that can efficiently use modern hardware is going to be a thread-per-core architecture with all I/O and execution scheduling done in user space, and the ability to operate on modern storage densities found on database servers, which can exceed a petabyte of direct-attached storage. The implications of storage density and its interaction with indexing drive most of the real changes in the way database kernels are designed. You can find elements of this in open source, but mostly in big data platforms rather than proper database engines.

That said, no one builds new high-end databases for retail anymore, the economics don't make sense. All the money moved to more specialized implementations that cater to smaller audiences where you don't need to advertise. The kernels are fully general, and widely reused, but the interfaces and surrounding bits are purpose-built for particular workloads. Hell, my old storage engines are still used under license by that lot. The days of database billboards on the 101 are an anachronism.

You keep talking about how B-Trees are rarely used but I’ve seen relatively new systems deployed that use them (or some minor variation). FoundationDB, FASTER, and a few others.

Other than in-memory hash indexing as used by SAP HANA, I’m not aware of any other data structures anywhere near as popular for database engines.

Can you name the data structure(s) that have superseded these?

> You are understating the limitations of B+trees for real workloads.

I never said anything about workloads. All I said was that your statements about B+Trees having dwindling usage are clearly false.

If you make a claim that is self-evidently bogus, then you shouldn't expect anything else that you may have said at the same time to be taken seriously.

Can you provide an explicitly named "modern" database that doesn't use B+tree indexes, and what specifically does it use instead?
" Like many modern analytical engines [18, 20], Procella does not use the conventional BTree style secondary indexes, opting instead for light weight secondary structures such as zone maps, bitmaps, bloom filters, partition and sort keys [1]. The metadata server serves this information during query planning time. These secondary structures are collected partly from the file headers during file registration, by the registration server, and partly lazily at query evaluation time by the data server. Schemas, table to file mapping, stats, zone maps and other metadata are mostly stored in the metadata store (in Bigtable [11] and Spanner [12])."

https://storage.googleapis.com/pub-tools-public-publication-...

Read this guy's past posts, you will save yourself a lot of time. He does a lot of this sort of thing.
...says the person who is confidently oblivious to the problems of mixing virtual methods and direct paged memory.
Thanks for the tip
You never post anything constructive. Just that everything in open source sucks while you do fancy algorithms in your basement with spatial data at the boring facility. Anytime anybody asks for more you never reply with something constructive or link to a paper/algorithm or something.

The maximum you've told is "yeah do what scylladb does but you will still suck".

It just feels as advertisement and doesn't really add anything to the discussion I believe. All your comments are the same in all database threads.

endrant

What kinds of indexing structures are used instead, and how do they differ from B+trees? Do you have examples of which relational databases have replaced B+tree indexes?
The property being optimized for, relative to B+trees, is extreme compactness of representation. In the pantheon of possible indexing algorithms, B+trees are pretty far on the bloated end of the spectrum in terms of the ratio between data space and index space. All indexes have a scaling performance cliff due to the index structure filling up and eventually overflowing available cache, crowding out the data and forcing page faults for almost every index write. In B+tree indexes this happens relatively early and often.

Radically improving index compactness is achieved by loosening design constraints on B+trees: the indexes represent a partial order which only converges on a total order at the limit and the search structure is unbalanced. In the abstract these appear slightly less efficient but it enables the use of selectivity-maximizing succinct representations of the key space that can get pretty close to the information theoretic limits. Scalability gains result from the radical reduction in cache footprint when represented this way.

Optimal compressive indexes are not computable (being equivalent to AI), so the efficient approximation strategies people come up with tend to be diverse, colorful, and sometimes impractical. Tangentially, some flavors have excellent write performance. It is not a trivial algorithm problem but there are a few design families that generalize well to real databases engines. I wouldn't describe this as a fully solved problem but many ordinary cases are covered.

There isn't much incentive to design a relational database engine that can use these types of indexes, since the types of workloads and data models that recommend them usually aren't relational. Someone could, there just isn't much incentive. It is more de rigueur for graph, spatiotemporal, and some types of analytical databases, where there is no other practical option if scalability matters at all.

I know Clickhouse uses MergeTrees which are different from B+trees. However it can't really be used as an RDBMS. It's especially bad at point reads.

https://en.wikipedia.org/wiki/Log-structured_merge-tree

There are projects that use LSMTs as the storage engine for RDBMS' (like RocksDB); I'm not sure it's accurate to say "they can't be use as an RDMBS".
It's definitely possible, and can make a lot of sense -- MyRocks/RocksDB for MySQL seems like an interesting and well designed system to me. It is fairly natural to compare MyRocks to InnoDB, since they're both MySQL storage engines. That kind of comparison is usually far more useful than an abstract comparison that ignores the practicalities of concurrency control and recovery.

The fact that MyRocks doesn't use B+Trees seems like half the story. Less than half, even. The really important difference between MyRocks and InnoDB is that MyRocks uses log-structured storage (one LSM tree for everything), while InnoDB uses a traditional write-ahead log with checkpoints, and with logical UNDO. There are multiple dimensions to optimize here, not just a single dimension. Focusing only on time/speed is much too reductive. In fact, Facebook themselves have said that they didn't set out to improve performance as such by adopting MyRocks. The actual goal was price/performance, particularly better write amplification and space amplification.

I should have clarified that I meant Clickhouse can't really be used as an RDBMS. I wasn't aware that RocksDB uses LSMTs.
Databases designed to query large amounts of data tend to rely on partition key to allow parallel workloads (i.e. map reduce). And this partition data is stored like hash index.

Big query would be one example.