|
I’m a huge fan of Postgres. This one is “user error”, but we still got bit pretty hard. A query plan changed, on a frequently-run query (~1k/sec) on a large table (~2B rows) without warning. Went from sub-millisecond to multi-second. The PG query planner is generally very good, but also very opaque. The statistics collected during an ANALYZE and used by the planner are subject to some significant caveats. Essentially, the planner would sometimes wildly mis-estimate costs due to under-sampling, and would choose a bad plan. We fixed it in two different ways: 1) lower the auto-ANALYZE threshold; 2) increase the number of rows sampled when collecting statistics for the relevant column. Again, this was “user error”. That said, it will probably happen again on the same or another query, because it’s hard to know if/when a query plan is about to change, and pg_hint_plan and similar are very heavy-handed solutions. |
I'm not even too bothered by the opaqueness of the query planner (although I'd love better visibility into it). But the fact that the query plan can change any second is insane: you can't lock it, and you can't force another one as a short-term fix.
There's no option that I know of. If you reach an impossible-to-anticipate threshold and the query plan changes, your whole system can be down and you can only fix forward, which might take a _long_ time to figure out and is super dangerous as you'll pretty much have to experiment on your prod database.
It's insane, I've not yet been bit too bad by it but I know it's coming for me.