| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dspillett 268 days ago

One thing to be careful of with the UNION ALL method, is that if you have some rows that match more than one of the clauses in your set of ORs then you will have duplicate results to screen out. This won't happen if you are checking for multiple values in one field, obviously, but is something to be wary of when using this method to optimise kitchen sink queries more generally.

Slapping a DISTINCT in isn't the answer, if it was then you'd just use UNION instead of UNION ALL, because that often makes the query planner call for a lot of excess work to be done. I once found that wrapping and main kitchen sink in a CTE and applying DISTINCT in the SELECT that calls it has the desired effect, but that is risky as it relies on undefined behaviour that may change at a later date and tank your performance. If the number of rows being returned is never going to be large, and your situation allows multiple statements, a safer option is to pull the data into a temporary table and SELECT from that with DISTINCT. Or you could de-dupe in the next layer instead of the DB, or just accept dupes if exact cardinality isn't important and your users aren't going to care about the occasional double-row (i.e. a simple quick-search feature).

And, of course, sometimes you want the duplicates, again maybe in the case of a quick search feature (where you split the results into categories and each duplicate of a row is likely in a different category, especially if the categories align with the search clauses that are being ORed).

1 comments

magicalhippo 268 days ago

Good point. In the described case the OR terms were guaranteed to be disjoint, in a way the query planner could easily figure out. Which Sybase's planner did.

However for cases like described in the article, you'd need to handle that.

While I like CTEs, I've had more consistent luck with subqueries. They also compose more easily.

link