Hacker News new | ask | show | jobs
by feike 2487 days ago
Good set of tips, however "Not NOT IN" is not true in general, as a single counterexample:

    EXPLAIN (costs off)
    SELECT relname, relpages
      FROM pg_class
     WHERE relname IN ('pg_class');

                         QUERY PLAN
  ---------------------------------------------------------
   Index Scan using pg_class_relname_nsp_index on pg_class
     Index Cond: (relname = 'pg_class'::name)
  (2 rows)

In general, I'd like to echo other's comments, learning how to use EXPLAIN and how to interpret query plans is the single most important tool, as it allows you to verify your hypothesis instead of relying on rules-of-thumb.
1 comments

I actually had situation where the using IN resulted in a query that run for 35s and after replacing with `ANY(VALUES(...` it took ~600ms.

In my case `column IN ('value1', 'value2')` was converted to an array like this: `((column)::text ANY ('{value1,value2}'::text[]))`, an index on the column was completely ignored and PG performed a sequential scan.

After changing it to `column = ANY (VALUES ('value1'),('value2'))` it created a table in memory and used it when performing an index scan.

I have feeling this is maybe some kind of a bug in query planner? This is on PG 9.6.8.

I think I'd need more information to diagnose; were there any NULL values in the list? How many values?
The column can have nulls but after a quick check there's only a single record with a null value (probably added as a test, no idea).

In the bad plan a sequential scan was performed and then it removed over 4 million rows (Rows Removed by Filter: 4562849)

Edit: here is the query planner's output: https://explain.depesz.com/s/eVXI

After rewriting the query (change IN into ANY(VALUES( and adding an index to the "seven" table): https://explain.depesz.com/s/b4kU

I think adding the index to the “seven” table is probably the more meaningful change.

You said the index was ignored, but based on that plan output the planner made the right choice since the filter on the “sierra_zulu” table (only one there with an ANY so I assume that’s the one) matches over 90% of the rows. An index scan on that would be far worse.

Adding an index on a different table allowed the driving side of the join to be different, and, with already prefiltered data allowed the other index to be useful as well (fewer index lookups). I’d bet with that index addition your IN clause works just fine.

If you look at sierra_zulu (line #3 on both plans) it took almost 22 seconds initially and after query rewrite (and no index change) took 713ms.

So that change of query alone reduced query time from 35s to ~5s, after that the seven table become the bottleneck and adding an index there reduced the whole query down to 750ms.

I should have saved the intermediate plan to illustrate it, but it essentially looked same as the second plan, except there was a sequence scan on seven table.