Hacker News new | ask | show | jobs
by Glyptodon 268 days ago
If optimization is really as simple as applying De Morgan's laws, surely it could be done within the query planner if that really is the main optimization switch? Or am I misreading this somehow?

Edit: I guess the main difference is that it's just calculating separate sets and then merging them, which isn't really DeMorgan's, but a calculation approach.

5 comments

In most cases, the end result is a lot messier than it looks in this minimal example. Consider a case where the query is a 12-way join with a large IN list somewhere (which is essentially an OR); would it still look as attractive to duplicate that 12-way join a thousand times?

There _are_ optimizers that try to represent this kind of AND/OR expression as sets of distinct ranges in order to at least be able to consider whether a thousand different index scans are worth it or not, with rather limited success (it doesn't integrate all that well when you get more than one table, and getting cardinalities for that many index scans can be pretty expensive).

I'm not seeing De Morgan. I am seeing inclusion exclusion, which is just a neat trick all round. Highly recommend remembering it.

I imagine negative filters to be a bit inefficient as well, though maybe not for a simple count.

Inclusion-exclusion is just the generalization of De Morgan's law for more than 2 sets.

And the example shows exactly two sets.

So it's exactly De Morgan's law.

The core issue the article is pointing to is that most database indexes are B-trees, so if you have a predicate on the form (col_a = 'foo' OR col_b = 'foo'), then it is impossible to use a single B-tree lookup to find all rows that match the predicate. You'd have to do two lookups and then merge the sets. Some query optimizers can do that, or at least things that are similar in spirit (e.g. Postgres bitmap index scan), but it's much more expensive than a regular index lookup.
> but it's much more expensive than a regular index lookup.

It doesn't have to be, it just "is" in some database engines for various historical reasons.

I.e.: PostgreSQL 18 is the first version to support "B-tree Skip Scan" operators: https://neon.com/postgresql/postgresql-18/skip-scan-btree

Other database engines are capable of this kind of thing to various degrees.

The linked article’s optimization applies to compound index queries, not “OR” condition optimization.

Unrelated or not, skip-scan will be useful in some cases. However, the cases where it adds noticeable benefit are the cases where a separate index should have been used for leading columns anyway (and in memory-constrained/frequently-cold-cache situations, a separate index might even be faster). If you can’t even begin to guess at the order of magnitude of cardinality, or if your leading-column-lacking queries are quite rare and not that perf-sensitive on a big table exerting index cache pressure, then skip scans make sense.

Skip-scan or similar code can solve the problem if you have a compound index on both columns.

The query planner can include the entire matching range for the 'A' column and check the 'B' column for matches via skip-scan.

This is possible in principle, but I don't believe most (any?) database engines use this specific approach.

It could be optimal if the results are required in A,B sorted order.

B-tree skip scan does not address the problem in the article at all.
SQLite3 turns ORs into UNIONs.
I think creating alternative plans is easy, knowing which one will be the fastest is the hard part.