Hacker News new | ask | show | jobs
by setr 2779 days ago
>it sounds like it took a pretty big leap of faith to go from heuristics to cost-based optimization

Im pretty sure cost-based optimization is common enough and practically proven effective enough that its not really fair to call it a “leap of faith”; eg im pretty sure mysql and postgres both do it (and you can it see it in their EXPLAINs), and I imagine most commercial dbs do as well

Its more of just a natural progression as the db gets improved

Its a good article/work nonetheless, but nothing out of the ordinary in db implementation afaik

2 comments

Postgres and MySQL both do calculate costs. One difference is that they are not based around the Memo data structure (someone correct me if I'm wrong about that!), which gives an extra measure of flexibility, power, and elegance to solving the problem of SQL optimization. The Memo structure was introduced by Goetz Graefe in his seminal papers on top-down, rule-based query optimization (Exodus, Volcano, Cascades), which are the chief inspiration for CockroachDB's new optimizer. In the late 90's, Goetz joined Microsoft, and assisted the SQL Server team in rewriting their optimizer from scratch based on the techniques he had pioneered. I believe this foundation is a big reason why the SQL Server optimizer has become one of the best (if not the best) in the world. We are indeed building on the shoulders of giants.
>I believe this foundation is a big reason why the SQL Server optimizer has become one of the best (if not the best) in the world

Im not at all versed in the subject, but it seems like the memo datastructure isn’t significant to the correctness of the outcome: it makes the search space smaller/faster, but doesn’t seem to lead to better cost assignment, and the logical equivalence rules should be enforced regardless of the memo dt (the alternative being external to the memoization structure)

Is it just that faster search => more paths searched before timing out/giving up? And perhaps easier to reason about/encode?

To be clear, the Exodus/Volcano/Cascades foundation includes more than just the Memo. Also important is the idea that you can declaratively define rewrite rules and generate parts of your optimizer. Also, optimization proceeds top-down rather than bottom-up, which allows you to prune the search space as you explore, as well as open up important lines of communication from nodes higher in the tree to nodes lower in the tree (i.e. required physical properties, which I explain in another HN comment).

You're correct on your points (search space can be smaller/faster, with less memory usage). In addition, when you're dealing with a piece of software as complex as a SQL optimizer, choosing modular, elegant abstractions can make a big difference in terms of understandability, maintainability and extensibility of the codebase. That in turn leads to fewer bugs, such as subtle violations of logical equivalence when transforming the tree. I think the more structured, general approach to optimization allows mere mortals to manage higher levels of complexity, in which the RDBMS defines 100's of transformations, offers many join ops (hash, merge, nested, zigzag, etc.) and data types (geospatial, json, etc), as well as incorporating data locality in the new world of geo-distribution.

As one example to illustrate why there's a practical correspondence between well-designed abstractions/architecture and correctness (i.e. fewer bugs), consider these PRs that we recently merged:

  https://github.com/cockroachdb/cockroach/pull/31977
  https://github.com/cockroachdb/cockroach/pull/30636
These PRs randomly disable transformation rules and perturb the cost of expressions in the optimizer, in order to test alternate (but equivalent) plans. It took just 100 lines of code to add this support, because it simply plugs into the general optimization framework. And in return, our existing test cases can now test multiple different plans, allowing us to shine light into the dark corners where bugs like to hide (cockroaches, of course!).

As another example, we have a testing mode called "optsteps", which will single-step the optimizer and print out the tree after each transformation to make it easy to track down bugs. Here's an example of that output, again made possible by the extensible, modular design of the optimizer:

  https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/opt/norm/testdata/rules/combo
Yes, Postgres, MySQL, MSSQL, Oracle, and i'm sure DB2 all use a cost based optimizer.

+1 on the sentiment, good to see them continuing to improve, they have lots of competition to look at for examples of what to do / not to do. Progress is good none the less.