Hacker News new | ask | show | jobs
by cogman10 268 days ago
> Has anyone put machine learning in an SQL query optimizer yet?

Yes, I think everyone has? At very least I know that MSSQL has because we semi regularly run into problems with it :).

MSSQL keeps track of query statistics and uses those in future planning. SOMETIMES it just so happens that the optimization for the general case makes the outlier 100x slower which kills general performance.

2 comments

For a while I was maintaining software that supported both MSSQL and PGSQL, and I found that, when comparing like-for-like without DB-specific tuning, MSSQL produced better query plans on average. On a database without write contention, I'd often see 30% better performance on MSSQL.

However, it was also much more likely to hit an AWFUL pathological case which completely wrecked the performance as you describe. Combined with pessimistic locking, we ended up with far more overnight support calls from the MSSQL backend than from the PGSQL backend. Usually because it suddenly decided to switch query plan at 1AM.

I wonder if there's a trade-off where an optimizer produces better average query plans but worse outliers.

> I wonder if there's a trade-off where an optimizer produces better average query plans but worse outliers.

There is. A (now) classic case is running a nested loop with a sequential scan on the inner side. Works great if the outer side produces few rows as expected, a disaster if there's a misestimation and you have a thousand rows there.

The MSSQL and Postgres planners are different enough that you can't really say it's about one thing, though (MSSQL is one of the few implementations of the Cascades framework in a sort-of hybrid AFAIK, Postgres is straight-up System R).

I think the big difference is that PostgreSQL doesn't cache query plans at all by default. Only if you use prepared statements. My understanding is that MSSQL does heavily cache them.

It sounds plausible to me that caching would often lead to significant performance improvements overall, but trigger bad plans much more often since the plans are not re-evaluated based on the statistics of each single query. So in Postgres you'd get individual queries with pathological performance when the statistics are off, in MSSQL all executions of that query have bad performance until the plan is re-evaluated again.

We're experiencing the same with MSSQL, and for our most important queries have started adding a constant-valued dummy column to the SELECT section which value changes every few minutes. Essentially an integer equal to UNIX time divided by 600 or similar.

That way a cached bad plan can't cause issues for more than a few minutes, which is acceptable for our use-case.

It's a sledgehammer but it was easy to add and it works.

In Oracle many years ago, we did the same thing by prepending a SQL comment to the query string. You’d think that the plan cacher normalizes queries first, but I guess not.

That might work in your case as well, without requiring modifications in logic to support the dummy field?

At 100x, it seems like you could run both optimal strategies every time, let them race, and still come out way ahead.
You double + the IO and potentially CPU time which is why this isn't done. It's also not always 100x, that just happens often enough. Sometimes it's only 2x or 1.5x. It's impossible to know which situation you are in and the hard thing is the outliers will be slow either way.