Hacker News new | ask | show | jobs
by petergeoghegan 1450 days ago
> Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.

This is pure nonsense. B+Trees are used extensively and by default by 5 out of 5 of the top database systems, according to db-engines.com.

1 comments

You don't actually address the point.

If your database engine is an old design or your data is small by modern standards, then a B+tree will be one of the few indexing algorithms available and if the data is small it will probably work. Modern database kernels targeting modern hardware and storage densities typically aren't using B+trees and the reasons why are well-understood. No one with any sense is using a B+tree to index e.g. a trillion records, which is a pretty ordinary thing to do on a single server in 2022.

You can't just swap out indexing architectures due to their dependency on storage engine and scheduling behavior, so older databases like PostgreSQL will be using B+trees for the indefinite future even if suboptimal.

The transition away from B+tree based architectures in new databases engines started about 10-15 years ago. Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.

> so older databases like PostgreSQL will be using B+trees for the indefinite future even if suboptimal.

PostgreSQL 14 comes with 6 builtin index types[1]: B-tree, Gist, SP-Gist, Gin, Brin, and Hash. More can be plugged in as extensions.

[1]: Chapters 63-69 of https://www.postgresql.org/docs/14/internals.html

Edited: Fixed the link to version 14.

> You don't actually address the point.

You said that B-Trees "use in indexing has dwindled with time". This is demonstrably false.

> Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.

Even if that was true (which it definitely isn't), why would anybody judge the commercial or scientific relevance of B-Trees by looking at what new systems do? There are very few new systems that are intended to be competitive as general purpose systems, which is where most of the market is.

You still haven't actually named a single example of a "modern database kernel" that exemplifies what you're talking about.

You are understating the limitations of B+trees for real workloads. A common and growing problem is the lack of online indexing that scales, the particularly data model doesn't matter that much. Index construction throughput and scaling has been a serious problem at some pretty boring companies I've done work for.

Use of B+trees in new database kernels has definitely diminished. I'm not counting the installed base of SQLite etc. Ubiquity doesn't make something the pinnacle of technology -- just as often it means "legacy installed base". I still use PostgreSQL a lot and mod it when I need to but I am not under any illusions about its limitations.

A "modern" database kernel that can efficiently use modern hardware is going to be a thread-per-core architecture with all I/O and execution scheduling done in user space, and the ability to operate on modern storage densities found on database servers, which can exceed a petabyte of direct-attached storage. The implications of storage density and its interaction with indexing drive most of the real changes in the way database kernels are designed. You can find elements of this in open source, but mostly in big data platforms rather than proper database engines.

That said, no one builds new high-end databases for retail anymore, the economics don't make sense. All the money moved to more specialized implementations that cater to smaller audiences where you don't need to advertise. The kernels are fully general, and widely reused, but the interfaces and surrounding bits are purpose-built for particular workloads. Hell, my old storage engines are still used under license by that lot. The days of database billboards on the 101 are an anachronism.

You keep talking about how B-Trees are rarely used but I’ve seen relatively new systems deployed that use them (or some minor variation). FoundationDB, FASTER, and a few others.

Other than in-memory hash indexing as used by SAP HANA, I’m not aware of any other data structures anywhere near as popular for database engines.

Can you name the data structure(s) that have superseded these?

> You are understating the limitations of B+trees for real workloads.

I never said anything about workloads. All I said was that your statements about B+Trees having dwindling usage are clearly false.

If you make a claim that is self-evidently bogus, then you shouldn't expect anything else that you may have said at the same time to be taken seriously.

Can you provide an explicitly named "modern" database that doesn't use B+tree indexes, and what specifically does it use instead?
" Like many modern analytical engines [18, 20], Procella does not use the conventional BTree style secondary indexes, opting instead for light weight secondary structures such as zone maps, bitmaps, bloom filters, partition and sort keys [1]. The metadata server serves this information during query planning time. These secondary structures are collected partly from the file headers during file registration, by the registration server, and partly lazily at query evaluation time by the data server. Schemas, table to file mapping, stats, zone maps and other metadata are mostly stored in the metadata store (in Bigtable [11] and Spanner [12])."

https://storage.googleapis.com/pub-tools-public-publication-...

> modern analytical engines

What about relational databases? Most are best suited for OLTP workloads.

Read this guy's past posts, you will save yourself a lot of time. He does a lot of this sort of thing.
...says the person who is confidently oblivious to the problems of mixing virtual methods and direct paged memory.
Thanks for the tip