Hacker News new | ask | show | jobs
by andydb 2779 days ago
Good question, and something I didn't have the space to cover in this article. The answer involves "physical properties". Physical properties are interesting characteristics of an expression that impact its layout, presentation, or location, but not its logical content. Examples include row order, column naming, and data distribution (physical location of data ranges in the cluster).

As an example, consider a memo group that contains both a MergeJoin and a HashJoin operator. The MergeJoin "requires" its inputs to be sorted on the same key. If an input is not already sorted, then we insert a "sort enforcer" operator that will provide the required ordering. Of course, that increases the cost of producing the input, which the optimizer must take into account.

So, getting back to your question, for any given memo group, we always select the equivalent plan with the lowest cost, for a given set of physical properties. However, that has the effect of sometimes selecting a memo group operator that has a higher cost, because it "pays off" higher up the operator tree. Back to my example, say the SQL query is something like this:

  SELECT * FROM a, b WHERE a.name=b.name ORDER BY a.name
The lowest cost plan might use a MergeJoin, because its output is already naturally sorted on "name". However, the input to the MergeJoin might be higher-cost than logical equivalents, because it must be sorted. Therefore, we have the situation you alluded to, where we choose a higher-cost subtree because it minimizes the cost of the larger tree that contains it.
1 comments

Great article!. While reading it I see some similarities with "Plan Stitching" from MSR as shown here https://www.microsoft.com/en-us/research/uploads/prod/2018/0.... It is interesting to see what the performance differences between the two. Also, I am curious of why build a query optimizer from scratch when you can adopt or port something like Orca?
Interesting MSR paper. With regards to Orca, we did take a look at it. It's an OLAP optimizer written in C++ (and very lightly commented and documented, unfortunately), and we wanted a Go-based optimizer that was more integrated with our codebase, as well as targeted at OLTP scenarios (as example Orca supports parallel search, which makes sense in OLAP, but not so much in OLTP). There is quite a bit of overlap, though, and we read their excellent paper and consulted the code from time to time (I hereby invoke a curse on Hungarian notation, which makes reading their code quite painful). We also suspect that Greenplum did not open source some key bits of the optimizer code, such as their search prioritization algorithms.
Yes, that code was painful indeed (was trying to port it at one time). On another note are there any plans on adding advisory tools for database tuning (such as AutoAdmin from MSR) or adding ML to self-tune depending on workload?
Database tuning tools are not on the near-term roadmap, but I think it's a good idea that we'll want to look at eventually. Using ML to self-tune databases is a hot research subject right now, so we'll be following developments on that front.