Hacker News new | ask | show | jobs
by Sesse__ 856 days ago
Yeah, I don't have all the answers, my point is that it would be worthwhile to try similarly dumb stuff and see if it works just as well and could be more robust in some places :-)

Multi-selectivities are good, but IIRC they can't be specified across tables and thus across joins, right? Out of curiosity; how do you reconcile multiple selectivities? If you have a multi-column histogram on (a,b) and one on (b,c) and one on (c), and you need to figure out the selectivity of WHERE a=? AND b=? AND c=? from those three? I looked into this at some point, and academia presented me with a nightmare of second-order cone programming and stuff. :-) (I never implemented any of it before leaving the database world.)

1 comments

> Multi-selectivities are good, but IIRC they can't be specified across tables and thus across joins, right?

Yeah, no extended statistics for join quals yet.

> how do you reconcile multiple selectivities?

Looking at https://doxygen.postgresql.org/extended__stats_8c.html#a3f10... it seems the aim is to find the stats that cover the largest number of clauses tiebreaking on the statistics with the least number of keys. For your example both of those are the same, so it seems that which stats are applied is down to the order the stats appear in the stats list. That list is ordered by OID, which does not seem ideal as a dump and restore could result in the stats getting a different OID. Seems sorting that list by statistics name might be better. That's what we do for triggers, which seems like a good idea as it gives the user some ability to control the trigger fire order.

OK, so basically ad-hoc/greedy. That can get you into, well, issues :-) There are methods where you first adjust away impossible statistics (if you have quals a,b, then P(a)P(b) >= P(a AND b), but if your estimates come from multiple places this might not hold[1] -- so you need to fudge one or more of those before proceeding), and then combine them optimally using Newton-Raphson. It's pretty neat. But it does require a matrix library and stuff :-)

[1] Similarly, you can sometimes guarantee stuff like P(a AND b) <= 1/n due to a unique multi-column index, which can also inform some of your sub-estimates.