Hacker News new | ask | show | jobs
by Animats 268 days ago
The query optimizer knows how many items are in each index, but has no advance idea how many items will be in the result of a JOIN. An "a OR b" query on a table with millions of rows might have three hits on A, or millions of hits. The optimal query strategy for the two cases is very different.

Has anyone put machine learning in an SQL query optimizer yet?

5 comments

> Has anyone put machine learning in an SQL query optimizer yet?

Yes, I think everyone has? At very least I know that MSSQL has because we semi regularly run into problems with it :).

MSSQL keeps track of query statistics and uses those in future planning. SOMETIMES it just so happens that the optimization for the general case makes the outlier 100x slower which kills general performance.

For a while I was maintaining software that supported both MSSQL and PGSQL, and I found that, when comparing like-for-like without DB-specific tuning, MSSQL produced better query plans on average. On a database without write contention, I'd often see 30% better performance on MSSQL.

However, it was also much more likely to hit an AWFUL pathological case which completely wrecked the performance as you describe. Combined with pessimistic locking, we ended up with far more overnight support calls from the MSSQL backend than from the PGSQL backend. Usually because it suddenly decided to switch query plan at 1AM.

I wonder if there's a trade-off where an optimizer produces better average query plans but worse outliers.

> I wonder if there's a trade-off where an optimizer produces better average query plans but worse outliers.

There is. A (now) classic case is running a nested loop with a sequential scan on the inner side. Works great if the outer side produces few rows as expected, a disaster if there's a misestimation and you have a thousand rows there.

The MSSQL and Postgres planners are different enough that you can't really say it's about one thing, though (MSSQL is one of the few implementations of the Cascades framework in a sort-of hybrid AFAIK, Postgres is straight-up System R).

I think the big difference is that PostgreSQL doesn't cache query plans at all by default. Only if you use prepared statements. My understanding is that MSSQL does heavily cache them.

It sounds plausible to me that caching would often lead to significant performance improvements overall, but trigger bad plans much more often since the plans are not re-evaluated based on the statistics of each single query. So in Postgres you'd get individual queries with pathological performance when the statistics are off, in MSSQL all executions of that query have bad performance until the plan is re-evaluated again.

We're experiencing the same with MSSQL, and for our most important queries have started adding a constant-valued dummy column to the SELECT section which value changes every few minutes. Essentially an integer equal to UNIX time divided by 600 or similar.

That way a cached bad plan can't cause issues for more than a few minutes, which is acceptable for our use-case.

It's a sledgehammer but it was easy to add and it works.

In Oracle many years ago, we did the same thing by prepending a SQL comment to the query string. You’d think that the plan cacher normalizes queries first, but I guess not.

That might work in your case as well, without requiring modifications in logic to support the dummy field?

At 100x, it seems like you could run both optimal strategies every time, let them race, and still come out way ahead.
You double + the IO and potentially CPU time which is why this isn't done. It's also not always 100x, that just happens often enough. Sometimes it's only 2x or 1.5x. It's impossible to know which situation you are in and the hard thing is the outliers will be slow either way.
I think having a way to build statistics on the join itself would be helpful for this. Similar to how extended statistics^1 can help when column distributions aren't independent of each other.

But this may require some basic materialized views, which postgres doesn't really have.

[1]: https://www.postgresql.org/docs/current/planner-stats.html#P...

could you elaborate on pg not really having matviews?
Materialized views in Postgres don't update incrementally as the data in the relevant tables updates.^1

In order to keep it up to date, the developer has to tell postgres to refresh the data and postgres will do all the work from scratch.

Incremental Materialized views are _hard_. This^2 article goes through how Materialize does it.

MSSQL does it really well from what I understand. They only have a few restrictions, though I've never used a MSSQL materialized view in production.^3

[1]: https://www.postgresql.org/docs/current/rules-materializedvi... [2]: https://www.scattered-thoughts.net/writing/materialize-decor... [3]: https://learn.microsoft.com/en-us/sql/t-sql/statements/creat...

The pg_ivm plugin adds incremental updates to Postgresql materialized views:

https://github.com/sraoss/pg_ivm

Though I don't know how well it works on a write-heavy production db.

As somebody who implemented manual incremental materialized tables using triggers, yeah it's pretty dang hard to make sure you get all the edge cases in which the data can mutate.
Some traditional planners try some different plans on random subsets of the data to see which plan works best. Don't need machine learning, just Bayes' rule.
> The query optimizer knows how many items are in each index, but has no advance idea how many items will be in the result of a JOIN.

Query optimizers definitely try to estimate cardinalities of joins. It's a really, really hard problem, but the typical estimate is _much_ better than “eh, no idea”.

There are many papers on ML for query planners. You can search for "Learned Query Optimization". Some use ML just for the cardinality estimation.