Hacker News new | ask | show | jobs
by davidrowley 856 days ago
That could be useful if there was a way to just disable non-parameterized nested loop, however enable_nestloop=0 also disables parameterized nested loops. Parameterized nested loops are useful to avoid sorting or hashing some large relation when only a small subset of that relation is likely to have a join partner. This is even more true when you consider that since PG14, Memoize exists to act as a cache between Nested Loop and its inner subnode to cache previously looked-up values.

It's also important to consider that with enable_nestloop=0, when Nested Loop must be used (e.g for a CROSS JOIN) that the cost penalty that's added to reduce the chances of Nested Loop being used can dilute the costs so much that the query planner can then go on to make poor subsequent choices later in planning due to the costs for each method of implementing the subsequent operation being so relatively close to each other than the slightly cheaper one might not even be considered. See add_path() and STD_FUZZ_FACTOR. So, running enable_nestloop=0 in production is not without risk.

1 comments

Yeah, nestloop with a cheap inner path (e.g. a lookup into a unique index) should be just fine, so I don't think nestloops as a whole should be banned. (Also, I believe Postgres is pretty much the only place I've seen the concept of a parameterized path described; it's not talked much about in academia, although it is probably really hard to make an index-aware System R planner without it.)

I wondered whether it would be possible just to add a fixed fuzz to every row estimate, say five rows. It would essentially mean you can never get this issue of a small undercount causing a plan disaster. Overestimating slightly is basically never a big issue as far as I know.

(I should perhaps have considered this when I was actually making a query planner in a previous life, but there were more than enough other things to worry about :-) )

It's a bit complex to explain here, but I describe an idea I've been considering in https://www.postgresql.org/message-id/CAApHDvo2sMPF9m=i+YPPU...
Yes, a lot of this seems interesting to go into. I really hope that at some point, someone would find the resources to just try a lot of dumb stuff and see what works in practice. I mean, what we have right now (multiply selectivities together as if they were independent) is also pretty dumb, and there's no good reason why it should be preferred over everything else.

Another avenue is of course trying to avoid the issue to begin with, e.g. through the recent “translation grids” of Müller and Moerkotte for better join selectivities. But I doubt anyone is going to be finding a silver bullet for this anytime soon, so reducing plan risk somehow seems very worthwhile.

> I mean, what we have right now (multiply selectivities together as if they were independent) is also pretty dumb

Yeah, I think it was probably a mistake to always assume there's zero correlation between columns, but what value is better to use as a default? At least extended statistics allows the correlations of multiple columns to be gathered now. That probably means we'd be less likely to reconsider changing the default assumption of zero correlation when multiplying selectivities.

Yeah, I don't have all the answers, my point is that it would be worthwhile to try similarly dumb stuff and see if it works just as well and could be more robust in some places :-)

Multi-selectivities are good, but IIRC they can't be specified across tables and thus across joins, right? Out of curiosity; how do you reconcile multiple selectivities? If you have a multi-column histogram on (a,b) and one on (b,c) and one on (c), and you need to figure out the selectivity of WHERE a=? AND b=? AND c=? from those three? I looked into this at some point, and academia presented me with a nightmare of second-order cone programming and stuff. :-) (I never implemented any of it before leaving the database world.)

> Multi-selectivities are good, but IIRC they can't be specified across tables and thus across joins, right?

Yeah, no extended statistics for join quals yet.

> how do you reconcile multiple selectivities?

Looking at https://doxygen.postgresql.org/extended__stats_8c.html#a3f10... it seems the aim is to find the stats that cover the largest number of clauses tiebreaking on the statistics with the least number of keys. For your example both of those are the same, so it seems that which stats are applied is down to the order the stats appear in the stats list. That list is ordered by OID, which does not seem ideal as a dump and restore could result in the stats getting a different OID. Seems sorting that list by statistics name might be better. That's what we do for triggers, which seems like a good idea as it gives the user some ability to control the trigger fire order.