Hacker News new | ask | show | jobs
by gfody 903 days ago
> It was invented over 40 years ago, yet it is still employed by the majority of modern databases.

I wonder how true this is for the top commercial engines (Oracle, MS, IBM, etc.) whose internals are closed source and proprietary. Even a decade ago my experience performance testing Exadata implied some exotic magic at work, ie lookups are way faster than the expected O(log n). More recently while testing SQL Server's ability to join hundreds of tables together the performance was _way_ in excess of what I expected. I can't imagine these things have internals all that similar to say the B+Tree inside MySQL for example.

2 comments

A lot of this comes down to query planners being really good at finding clever ways of doing scans and intersections of indexes, the tables themselves having indexes with a bunch of specialized representations, and the query execution doing very intelligent data traversal with partitioning or even multi-threading.

If you sit down and think carefully about your data you can often make even a simple bare-bones B-tree perform fantastically for a query, well in excess of what you'd get out of mysql or sqlite (which are already pretty fast).

I think the most important takeaway is that the old school RDBMS products are probably more than enough for whatever you are trying to accomplish. Query planners in these are indistinguishable from magic, as should anything that has been forged in the fires of a million production environments for a few decades.

I've been playing around with an idea that involves putting sql server at the heart of a game engine, and it is turning into one of the biggest rabbit holes I've ever explored. I thought latency/jitter would be more of a problem but it simply isn't.

On disk, SQL Server uses only b-trees, unless using the new ColumnStore format.

In memory during a query it can use temporary indexes of other types, primarily hash tables and bitmaps.

Its performance on ad-hoc complex queries is about as good as it gets, few if any other RDBMS can beat its performance, but under the hood it’s still mostly just doing joins on b-trees!

> ..under the hood it’s still mostly just doing joins on b-trees!

I could see the on-disk format needing to be simple and stable, but once the datas buffered who knows what structures and algorithms these proprietary engines are using? You would need to have done some reverse engineering or had hands-on details from the inside which presumably comes w/legal consequences for leaking them.

It tells you what it uses when you inspect the “query plans”. There’s some fairly technical explanation in the docs of what each of the operators do.

Generally the secret sauce in these things is the query optimiser heuristics.

The actual data structures and algorithm are often relatively simple.

Having said that, I’ve read their whitepaper on how they implement hash tables, and… it’s way more complex than I had assumed.

They cater for scenarios like many duplicated keys, parallel construction, unbalanced load across CPU cores, etc…

the query plan says "key lookup" or "index seek" can we really assume anything about the implementation?