| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hyperman1 1375 days ago

In general I love what RDBMS and postgresql in particular can bring to you, but this is one corner of them that I hate: Planners are too smart for their own good.

This is a standard story: A query ran for a long time without issue, and then, one day, some minor shift in your statistics happens, and now you have a major performance issue in your hands, without any real change on prod. No non-productive environment can help you: They don't have the same data or they do have the same data but the statistic sampling was slightly different.

If it happens, you have to find an incantation that makes the planner comprehend what's going wrong. Postgresql has things like the statistic object and in this case the statistic column property, but finding the right incantation can be black magic, and adding indexes/statistics/... can take hours so trial and error is slow.

Dumber databases have an edge here: Their performance is probably lower, but it is predictable.

Some RDBMS have mitigations, e.g. Oracle's optimizer plan stability allows you to make the plan unchangeable. It's a 2-sided knife of course: It won't get better if the data has a chance for it, but it won't get worse either.

8 comments

lawrjone 1375 days ago

Author of the article here, and thought it's worth noting on:

> No non-productive environment can help you: They don't have the same data or they do have the same data but the statistic sampling was slightly different.

GoCardless still has a massive Postgres database (10TB or there-abouts) and only managed to scale it by investing heavily in tooling that helps developers work with it safely.

One example is https://github.com/gocardless/draupnir, a tool to create instances of production datasets very quickly (just `eval $(draupnir new); psql` and you have a mini production in ~3s) so you could try things like adding indexes, tweaking the plan settings (`set enable_seq_scan='off'`) and reshaping the data to see how your planner behaved.

I think it's very doable, though the planner still has blindspots. I had a side project to add linear correlation statistics to the planner that I abandoned when I stopped working with big Postgres databases, but that's an example of statistics that Postgres just doesn't track but lead to these pathological edge cases.

I'd rather have the clever planner than not, though. I've a healthy appreciation for the heavy lifting Postgres can do for you.

hyperman1 1375 days ago

Very true. Postgres is a good product, using it in prod without much maintenance is very doable, and the planner is a good thing that regularly outsmarts me with a strategy that is better than what I had in mind. I am sitting on a 1TB db with it right now, expected to grow to ~10TB in the next few years. I like what postgres brings to the table for me.

But it's not all roses and rainbows. Draupnir seem cool, but it can't help you avoid the problem, only fix it faster.

At the core, there is a trade off here: Performance for predictability. You see the same thing in compilers, JITs and sometimes even processor cores. There is an optimizer in these things, that works 99% of the time, and makes things clearly better than human effort alone allows. But once in a while it guesses wrong, and you fall off a performance cliff.

Meanwhile, predictability is valuable in production, even to the point you might want to trade a serious dent in your performance for it.

retcore 1375 days ago

Wow. Gocardless has over five years of engineering invested in scaling Postgres, and it was a on going development to enable them to "work with it safely"? At what point did operating safety concerns arise? I'm absolutely no fan of Oracle business practices, but this sounds like a story that sales can dine on for years, notwithstanding my private conviction that I could negotiate Exadata for less than the gross salaries excluding employer contributions.

lawrjone 1375 days ago

Of course GoCardless have to invest engineering effort into scaling the technology they use.

The company doubles in size every year, with totally different products and use patterns.

If you're under the impression that purchasing an Oracle database would mean GC can let go of their engineers then I have a bridge to sell you!

hyperman1 1375 days ago

As an escapee from Oracle land: I am not at all advocating purchasing it. Postgres in general is great, even if Oracle has some features that are better. The enterprise class insanity of the Oracle world is literally a cause of developer depressions in some cases. Run away.

retcore 1374 days ago

Hi,

However do you make it that I have claimed that a Oracle license would make your team redundant?

I certainly didn't suggest anything like that.

But please forgive me for being reflexive, has your response accidentally been more revealing than intended about duplicating technologies that isn't AS clear as it might be from the context?

Nobody gets to let go any good database team this side of sanity for whatever reason.

Nevertheless I didn't phrase my comment as thoroughly as I probably should have:

"Working safely with" any asset class data store just shouldn't ever be a question without immediate answers. On going development for the same pirates, doesn't provide executive management with solid answers to "define safety issues present future and potentially retrospectively debugging any failure".

10TB primary dataset isn't considerable amount of production in valuable chain scale of store. Default not a large database on any Oracle installation.

DBMSs require administrative rigors and procedures as well as ideally in depth theory of operations and definitely codebase development skills with the engineering and management system itself is extremely desirable.

However, I can't help thinking that here is a potential case of taking those undoubted talents to directly create proprietary variants of Postgres, which is only going to develop technical debts and future increase in nominally normative support costs.

In other words, I think that your talents have been inappropriately unleashed. You're brain surgeons and everything looks like a brain to you?

Unfortunately and obviously the mere mention of Oracle is liable to create greater difficulty in the creation of openly equitable technical discussions. Oracle management managed to perfect this awful disassociation effect I'm convinced purely for stress testing potential customers often in a gaighting style / hazing sales process . Check out the lady Oracle sales executive who files suit in California every year just to try and negotiate payments of some approximate order to her contractual commission deal. Not many smaller companies get much beyond that not inconsiderable corporate culture clash. I used to joke that Oracle sales cycles were a super proving ground for whether you have a growth business model or not. Because account growth is the keys to sales commissions and simultaneously your easiest leverage for getting big discounts. (Start from 50% before anyone says anything, kept our business afloat)

Ultimately I am unable to understand the point you're making, because you claim inseparable effects from tangential issues and not the smallest misrepresentation of my argument.

Being pedantic, neither of us ought to have used the word "purchase" in relation to licensing nigh inseparable from support and other fat margins. You can still buy a per socket license for RDB , however, and obtain the x64 40% socket discount if you can get VMS X64 running. Sometime soon I'm going to risk indicating that we've renewed interest in such a installation.

The problem I have with the engineering path you took is merely that for the scale and growth outlined, off the shelf solutions absolutely exist and some are thoroughly honed for optimal low administration and even lights out running. We're all potentially caught up in the limitations of early start up scaling of essential computer services, when ad hoc OSS wrangling definitely can sound more attractive than months failing to even understand the small talk spoken by the whole freaking teams turning up to sell you big company shrink wrap software. I'm going round this once again, and have cut myself the budget for dumping all non novel problems onto the most tried solutions. In other end, being ruthless to only spend engineering resources on strategic advantage absolutely can encompass a large license deal or two, but ironically whilst wanting to get rid of Oracle (together with historical deployment and tuning sins) is just that much more attractive a fictional moral campaign than holy war against teaching development teams about transactions on payroll. I'm cynical indeed, but I hope the circumstances are more clear now?

rockwotj 1375 days ago

We've had a issue similar here due to using SERIALIZABLE transactions, and postgres chosing an index that caused it to lock the whole relation due to how locks are upgraded if you scan for too much [1]

Every change to our prod DB requires running EXPLAIN and EXPLAIN ANALYZE on some data to make sure the queries are doing the right thing (we use GCP Query Insights to watch for regressions [2]).

The cast majority of our queries are single index scans. I wish there was a database that we could fix the plan when our app is deployed. For the most part our schema is fairly denormalized so we don't need very complex queries. The flexibility/power of SQL is really for debugging, analytics and other one off queries.

Hot take: I wish there was a DB that didn't force SQL. At least for the application, instead you just told it what scan you wanted to do (basically the embed plan directly in the query you send). There could be a reporting mechanism if the DB detected a more efficient plan for the query or something. You could still have a SQL layer for your debug and one off sessions.

I would vastly prefer the predictability over occasional performance spikes or in our case a spike of transaction failures due to a predicate lock being grabbed for a whole table.

[1]: the default here is 32 rows (https://www.postgresql.org/docs/current/runtime-config-locks...)

[2]: https://cloud.google.com/sql/docs/postgres/using-query-insig...

jonatron 1375 days ago

Postgres is the odd one out. MySQL [1] has plenty of index hints, including FORCE. In the proprietary world, MS SQL Server and Oracle also have query hints. I don't know if there's anything other than a wiki page [2] that hasn't been updated since 2015 that justifies it.

[1]: https://dev.mysql.com/doc/refman/8.0/en/index-hints.html [2]: https://wiki.postgresql.org/wiki/OptimizerHintsDiscussion

rockwotj 1375 days ago

We did try pghintplan and it didn't seem to work for us unfortunately

https://pghintplan.osdn.jp/pg_hint_plan.html

petergeoghegan 1375 days ago

> Planners are too smart for their own good.

I know what you mean, but I don't think that that quite captures it. It's more like this: planners are built on a set of assumptions that are often pretty far from robust, but nevertheless work adequately well in almost all cases. Including many cases where the assumptions haven't been met!

The best example is the standard assumption that multiple conditions/columns are independent of each other -- all optimizers make this assumption (some can be coxed into recognizing specific exceptions). This is obviously not true much of the time, even with a well normalized schema. Because: why would it be?

All kinds of correlations naturally appear in real data. It's just that it mostly doesn't cause huge problems most of the time, for messy reasons that can't quite be pinned down. You have to get unlucky; the correlations usually have to be very high, and the planner makes completely the wrong inference for the actual query that you ran (not some hypothetical other query). The planner only has to have approximately the right idea to discover the cheapest plan. And the planner doesn't have to discover the cheapest plan in many cases -- there may be quite a few adequate plans (it's really hard to generalize, but that's often true).

Overall, the fact that cost-based optimizers work as well as they do seems quite surprising to me.

darksaints 1375 days ago

> Some RDBMS have mitigations, e.g. Oracle's optimizer plan stability allows you to make the plan unchangeable. It's a 2-sided knife of course: It won't get better if the data has a chance for it, but it won't get worse either.

That's simply not true, it's just less noticeable. Because even if your query plan is not changing, your data is. There will always be some point where your data grows and a reasonable planner (whether you or your database) has to adapt to that as it grows. For example, if a small lookup table grows enough, it stops being faster to do a full table scan on that table, and it becomes reasonable to do an index lookup. If your plan never changes, your performance gets worse. You may argue that fixed query plans are more predictable, but they are not objectively better.

hyperman1 1375 days ago

To clarify: [the plan] won't get worse. It will of course be less adapted to the new reality. This generally means you'll get a gradual performance detoriation, but not an unexpected cliff. Gradual detoriation is preferrable on prod, as it gives you time to react without causing a major incident. Of course, if you ignore the warnings, you're just as dead.

cube2222 1375 days ago

Afaik Aurora lets you pin plans and will at the same time ask you to approve new, better plans, if ones are discovered.

I haven't used it yet, but it sounds like that gives you the cake and lets you eat it too.

gurjeet 1375 days ago

> A query ran for a long time without issue, and then, one day, some minor shift in your statistics happens, and now you have a major performance issue ..

This is the precise problem I’m working on solving. See the pg_plan_guarantee extension.

https://github.com/DrPostgres/pg_plan_guarantee

hyperman1 1375 days ago

I like it. This is comparable to the oracle plan stability feature.

I don't like the interface, however. As you have to wrap the query with custom markers $pgpg$, you can't use it on anything that programatically generates the query, like an ORM.

I'd prefer an interface where you have a table that maps the query (hash?) to a plan. Then create a stored procedure e.g. nail_plan('SELECT blah blah blah') that inserts a record in that table. You can then backup and restore plans, easily query what plans are guaranteed, maybe even migrate plans between dev and prod. Table could also mark which plans are now invalid.

paulryanrogers 1375 days ago

There is also pg_hint_plan. MySQL too can suffer from this lurking thresholds problem. Though it has hints built in.

Upgrading major versions and huge surges of writes are often the catalyst of crossing performance cliffs, IME. The first can be planned for and since load testing can help get ahead of the second.

stonemetal12 1375 days ago

Could be cool if the RDBMS A B tested their plans, if the new plan isn't better don't switch to it. Though that would certainly add to the Black Magic of it, maybe a command to show the dev the top 5 plans and allow them to pick and pin.

darksaints 1375 days ago

I don't think that's really necessary, and it could make the predictability problem much worse. What is really needed is a better cost model. Query planners fail only when their statistics and cost estimates do not reflect reality. What we need are statistical distributions for costs, with those distributions updating after every query. And any time you prepare a statement, it should be looking for patterns to create better column/row statistics and indexing schemes.

hyperman1 1375 days ago

That won't work, unfortunately. It just pushes the issue to the next postgres restart.