Hacker News new | ask | show | jobs
by edmundhuber 2779 days ago
Great article, and it sounds like it took a pretty big leap of faith to go from heuristics to cost-based optimization. If this was addressed in the article I apologize: do you only ever select transformations that yield equivalent results, with immediately lower cost, or do you also explore transformations that incur in immediate cost increase, but then later pay off? i.e. are you doing simple hill climbing, or something more interesting?
2 comments

Good question, and something I didn't have the space to cover in this article. The answer involves "physical properties". Physical properties are interesting characteristics of an expression that impact its layout, presentation, or location, but not its logical content. Examples include row order, column naming, and data distribution (physical location of data ranges in the cluster).

As an example, consider a memo group that contains both a MergeJoin and a HashJoin operator. The MergeJoin "requires" its inputs to be sorted on the same key. If an input is not already sorted, then we insert a "sort enforcer" operator that will provide the required ordering. Of course, that increases the cost of producing the input, which the optimizer must take into account.

So, getting back to your question, for any given memo group, we always select the equivalent plan with the lowest cost, for a given set of physical properties. However, that has the effect of sometimes selecting a memo group operator that has a higher cost, because it "pays off" higher up the operator tree. Back to my example, say the SQL query is something like this:

  SELECT * FROM a, b WHERE a.name=b.name ORDER BY a.name
The lowest cost plan might use a MergeJoin, because its output is already naturally sorted on "name". However, the input to the MergeJoin might be higher-cost than logical equivalents, because it must be sorted. Therefore, we have the situation you alluded to, where we choose a higher-cost subtree because it minimizes the cost of the larger tree that contains it.
Great article!. While reading it I see some similarities with "Plan Stitching" from MSR as shown here https://www.microsoft.com/en-us/research/uploads/prod/2018/0.... It is interesting to see what the performance differences between the two. Also, I am curious of why build a query optimizer from scratch when you can adopt or port something like Orca?
Interesting MSR paper. With regards to Orca, we did take a look at it. It's an OLAP optimizer written in C++ (and very lightly commented and documented, unfortunately), and we wanted a Go-based optimizer that was more integrated with our codebase, as well as targeted at OLTP scenarios (as example Orca supports parallel search, which makes sense in OLAP, but not so much in OLTP). There is quite a bit of overlap, though, and we read their excellent paper and consulted the code from time to time (I hereby invoke a curse on Hungarian notation, which makes reading their code quite painful). We also suspect that Greenplum did not open source some key bits of the optimizer code, such as their search prioritization algorithms.
Yes, that code was painful indeed (was trying to port it at one time). On another note are there any plans on adding advisory tools for database tuning (such as AutoAdmin from MSR) or adding ML to self-tune depending on workload?
Database tuning tools are not on the near-term roadmap, but I think it's a good idea that we'll want to look at eventually. Using ML to self-tune databases is a hot research subject right now, so we'll be following developments on that front.
>it sounds like it took a pretty big leap of faith to go from heuristics to cost-based optimization

Im pretty sure cost-based optimization is common enough and practically proven effective enough that its not really fair to call it a “leap of faith”; eg im pretty sure mysql and postgres both do it (and you can it see it in their EXPLAINs), and I imagine most commercial dbs do as well

Its more of just a natural progression as the db gets improved

Its a good article/work nonetheless, but nothing out of the ordinary in db implementation afaik

Postgres and MySQL both do calculate costs. One difference is that they are not based around the Memo data structure (someone correct me if I'm wrong about that!), which gives an extra measure of flexibility, power, and elegance to solving the problem of SQL optimization. The Memo structure was introduced by Goetz Graefe in his seminal papers on top-down, rule-based query optimization (Exodus, Volcano, Cascades), which are the chief inspiration for CockroachDB's new optimizer. In the late 90's, Goetz joined Microsoft, and assisted the SQL Server team in rewriting their optimizer from scratch based on the techniques he had pioneered. I believe this foundation is a big reason why the SQL Server optimizer has become one of the best (if not the best) in the world. We are indeed building on the shoulders of giants.
>I believe this foundation is a big reason why the SQL Server optimizer has become one of the best (if not the best) in the world

Im not at all versed in the subject, but it seems like the memo datastructure isn’t significant to the correctness of the outcome: it makes the search space smaller/faster, but doesn’t seem to lead to better cost assignment, and the logical equivalence rules should be enforced regardless of the memo dt (the alternative being external to the memoization structure)

Is it just that faster search => more paths searched before timing out/giving up? And perhaps easier to reason about/encode?

To be clear, the Exodus/Volcano/Cascades foundation includes more than just the Memo. Also important is the idea that you can declaratively define rewrite rules and generate parts of your optimizer. Also, optimization proceeds top-down rather than bottom-up, which allows you to prune the search space as you explore, as well as open up important lines of communication from nodes higher in the tree to nodes lower in the tree (i.e. required physical properties, which I explain in another HN comment).

You're correct on your points (search space can be smaller/faster, with less memory usage). In addition, when you're dealing with a piece of software as complex as a SQL optimizer, choosing modular, elegant abstractions can make a big difference in terms of understandability, maintainability and extensibility of the codebase. That in turn leads to fewer bugs, such as subtle violations of logical equivalence when transforming the tree. I think the more structured, general approach to optimization allows mere mortals to manage higher levels of complexity, in which the RDBMS defines 100's of transformations, offers many join ops (hash, merge, nested, zigzag, etc.) and data types (geospatial, json, etc), as well as incorporating data locality in the new world of geo-distribution.

As one example to illustrate why there's a practical correspondence between well-designed abstractions/architecture and correctness (i.e. fewer bugs), consider these PRs that we recently merged:

  https://github.com/cockroachdb/cockroach/pull/31977
  https://github.com/cockroachdb/cockroach/pull/30636
These PRs randomly disable transformation rules and perturb the cost of expressions in the optimizer, in order to test alternate (but equivalent) plans. It took just 100 lines of code to add this support, because it simply plugs into the general optimization framework. And in return, our existing test cases can now test multiple different plans, allowing us to shine light into the dark corners where bugs like to hide (cockroaches, of course!).

As another example, we have a testing mode called "optsteps", which will single-step the optimizer and print out the tree after each transformation to make it easy to track down bugs. Here's an example of that output, again made possible by the extensible, modular design of the optimizer:

  https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/opt/norm/testdata/rules/combo
Yes, Postgres, MySQL, MSSQL, Oracle, and i'm sure DB2 all use a cost based optimizer.

+1 on the sentiment, good to see them continuing to improve, they have lots of competition to look at for examples of what to do / not to do. Progress is good none the less.