Hacker News new | ask | show | jobs
by Sesse__ 857 days ago
Yeah, nestloop with a cheap inner path (e.g. a lookup into a unique index) should be just fine, so I don't think nestloops as a whole should be banned. (Also, I believe Postgres is pretty much the only place I've seen the concept of a parameterized path described; it's not talked much about in academia, although it is probably really hard to make an index-aware System R planner without it.)

I wondered whether it would be possible just to add a fixed fuzz to every row estimate, say five rows. It would essentially mean you can never get this issue of a small undercount causing a plan disaster. Overestimating slightly is basically never a big issue as far as I know.

(I should perhaps have considered this when I was actually making a query planner in a previous life, but there were more than enough other things to worry about :-) )

1 comments

It's a bit complex to explain here, but I describe an idea I've been considering in https://www.postgresql.org/message-id/CAApHDvo2sMPF9m=i+YPPU...
Yes, a lot of this seems interesting to go into. I really hope that at some point, someone would find the resources to just try a lot of dumb stuff and see what works in practice. I mean, what we have right now (multiply selectivities together as if they were independent) is also pretty dumb, and there's no good reason why it should be preferred over everything else.

Another avenue is of course trying to avoid the issue to begin with, e.g. through the recent “translation grids” of Müller and Moerkotte for better join selectivities. But I doubt anyone is going to be finding a silver bullet for this anytime soon, so reducing plan risk somehow seems very worthwhile.

> I mean, what we have right now (multiply selectivities together as if they were independent) is also pretty dumb

Yeah, I think it was probably a mistake to always assume there's zero correlation between columns, but what value is better to use as a default? At least extended statistics allows the correlations of multiple columns to be gathered now. That probably means we'd be less likely to reconsider changing the default assumption of zero correlation when multiplying selectivities.

Yeah, I don't have all the answers, my point is that it would be worthwhile to try similarly dumb stuff and see if it works just as well and could be more robust in some places :-)

Multi-selectivities are good, but IIRC they can't be specified across tables and thus across joins, right? Out of curiosity; how do you reconcile multiple selectivities? If you have a multi-column histogram on (a,b) and one on (b,c) and one on (c), and you need to figure out the selectivity of WHERE a=? AND b=? AND c=? from those three? I looked into this at some point, and academia presented me with a nightmare of second-order cone programming and stuff. :-) (I never implemented any of it before leaving the database world.)

> Multi-selectivities are good, but IIRC they can't be specified across tables and thus across joins, right?

Yeah, no extended statistics for join quals yet.

> how do you reconcile multiple selectivities?

Looking at https://doxygen.postgresql.org/extended__stats_8c.html#a3f10... it seems the aim is to find the stats that cover the largest number of clauses tiebreaking on the statistics with the least number of keys. For your example both of those are the same, so it seems that which stats are applied is down to the order the stats appear in the stats list. That list is ordered by OID, which does not seem ideal as a dump and restore could result in the stats getting a different OID. Seems sorting that list by statistics name might be better. That's what we do for triggers, which seems like a good idea as it gives the user some ability to control the trigger fire order.

OK, so basically ad-hoc/greedy. That can get you into, well, issues :-) There are methods where you first adjust away impossible statistics (if you have quals a,b, then P(a)P(b) >= P(a AND b), but if your estimates come from multiple places this might not hold[1] -- so you need to fudge one or more of those before proceeding), and then combine them optimally using Newton-Raphson. It's pretty neat. But it does require a matrix library and stuff :-)

[1] Similarly, you can sometimes guarantee stuff like P(a AND b) <= 1/n due to a unique multi-column index, which can also inform some of your sub-estimates.