Hacker News new | ask | show | jobs
by 1500100900 1483 days ago
SELECT DISTINCT is rarely the best idea. In most cases it means someone is using joins to filter out rows that should have been removed with EXISTS (a semi-join).
3 comments

Interesting, I hadn't thought of it that way. Looking through our codebase:

- We get a lot of mileage out of DISTINCT ON to get the most recent version of a row for some subset of columns. I think this a different use case than what you refer to.

- We typically use DISTINCT to get PKs for a table when starting from another table. Like, find all user IDs that commented on a given story by looking at the join table.

    SELECT DISTINCT user_id FROM story_comment WHERE story_id = 4
- "most recent" is usually best done with a max() with an adequate index or, when that's not possible, with a LATERAL JOIN, an ORDER BY and a LIMIT 1

- SELECT id FROM users WHERE EXISTS (SELECT 1 FROM story_comment WHERE story_id = 4 AND story_comment.user_id = users.id)

Can you provide an example of how you'd rewrite a query using SELECT DESTINCT with EXISTS?
As far as I know, nested loop semi joins using EXISTS aren't always capable of using indexes.

Like in the case of a GIN indexed array intersection.