Hacker News new | ask | show | jobs
by GordonS 1962 days ago
> No, I don't think statistics can let you get away with this. Databases are concurrent, you can't guarantee that a different session will not insert a record that invalidates your current statistics.

Of course you can't make a guarantee like that, but why would you need to? Statistics are there to guide planner choices, not make cast iron predictions.

2 comments

Let us say that in our example table we have 100 000 records with severit_id < 7, 200 000 with severity_id = 7 and 3 records with severity_id = 8.

Statistics claim 100k id < 7, 200k id = 7 and 0 with id > 7. The last 3 updates could have happened right before our query, the statistics didn't update yet.

Let us assume that we blindly trust the statistics and they currently state that there are absolutely no values with severity_id > 7 and you have a query WHERE severity_id != 7 and a partial index on severity_id < 7.

If you trust the statistics and actually use the index the rows containing severity_id = 8 will never be returned by the query even if they exist. So by using the index you only scan 100 k rows and never touch the remaining ~200k. However this query can't be answered without scanning all ~300k records. This means, that on the same database you would get two different results for the exact same query if you decided to drop the index after the first run. The database can't fall back and change the plan during execution.

Perhaps I misunderstood you originally. I thought you suggested that the database should be able to know that it can still use the index because currently the statistics claim that there are no records that would make the result incorrect. You are of course correct, that the statistics are there to guide the planner choices and that is how they are used within PostgreSQL - however some plans will give different results if your assumption about data are wrong.

Because the database has to decided whether or not to use the index. If it decides to use the index, and there are values above 7, then it will misbehave (the query will miss those results). Now of course the database could then scan the rows for values above 7 it missed but at that point there's no point in using the index and you might as well have row scanned for the original query.

As a result, the database has to be 100% sure that there are no values _at all_ above 7 to safely and efficiently use the index, ex. when there's a constraint.