| Author of the article here, and thought it's worth noting on: > No non-productive environment can help you: They don't have the same data or they do have the same data but the statistic sampling was slightly different. GoCardless still has a massive Postgres database (10TB or there-abouts) and only managed to scale it by investing heavily in tooling that helps developers work with it safely. One example is https://github.com/gocardless/draupnir, a tool to create instances of production datasets very quickly (just `eval $(draupnir new); psql` and you have a mini production in ~3s) so you could try things like adding indexes, tweaking the plan settings (`set enable_seq_scan='off'`) and reshaping the data to see how your planner behaved. I think it's very doable, though the planner still has blindspots. I had a side project to add linear correlation statistics to the planner that I abandoned when I stopped working with big Postgres databases, but that's an example of statistics that Postgres just doesn't track but lead to these pathological edge cases. I'd rather have the clever planner than not, though. I've a healthy appreciation for the heavy lifting Postgres can do for you. |
But it's not all roses and rainbows. Draupnir seem cool, but it can't help you avoid the problem, only fix it faster.
At the core, there is a trade off here: Performance for predictability. You see the same thing in compilers, JITs and sometimes even processor cores. There is an optimizer in these things, that works 99% of the time, and makes things clearly better than human effort alone allows. But once in a while it guesses wrong, and you fall off a performance cliff.
Meanwhile, predictability is valuable in production, even to the point you might want to trade a serious dent in your performance for it.