Hacker News new | ask | show | jobs
by Drakar1903 2247 days ago
I will be very glad if you actually answer these questions, in a separate comment. I'm driven to write this by nothing but the desire to know.
2 comments

The answer to the performance-related ones is simple. It always depends on the your query, your indexes, and your data. The same query can produce wildly different query plans if you have different data (or even if you have the same data, but the engine decided to sample different rows!)

Best thing you can do is to learn how to read EXPLAIN ANALYZE results.

> - When to use JOIN vs a subquery?

Normally, subqueries return a single value whereas joins can result in n rows of output for 1 row being joined on, and you can access all the columns of those n rows. There are ways to make use of more than one value (e.g. (tuple) IN (subquery)) but if you want to SELECT more than one value, you need to join.

Depending on the database, it might be slower to do a correlated subquery than a join though (MySQL especially).

> - When is a subquery actually a correlated subquery? Will this destroy your performance? Or is it a critical feature?

A subquery is a correlated subquery when it references symbols from the outer query. That means it needs to be evaluated once per row, and can't be evaluated once at the start of query execution. It can destroy performance if it needs to be evaluated too often - if it's in your 'where' clause and is evaluated over too many rows, e.g. it's mixed in with a boolean expression that can't be short cut.

> - Should you put constraints in the JOIN or in the WHERE? Will the distinction drastically affect performance?

Conventionally, you should put equi-join constraints (equality expressions with foreign / primary keys in other tables) in the JOIN clause and other constraints in the WHERE clause. For inner joins, it doesn't make a difference where you put the predicate, semantically. There is a semantic difference for outer joins though (left join, right join, full outer join): failure to join results in a tuple worth of null values from one or both sides (left/right vs full), rather than eliminating the row.

Where the semantics aren't different, performance should not be affected. Of course the database engine might be stupid, but a fundamental requirement of a reasonable query planner is in determining (a) join order and (b) which indexes to use for the combination of join predicate and where predicate. No query planner worth its salt won't consider using the where clause along with the ON clause on the JOIN when fetching rows in the joined table.

> - When do you use WHERE vs HAVING?

WHERE is before GROUP BY and filters the rows that enter aggregation (if any), HAVING comes after GROUP BY and filters the aggregated rows. If you use a derived table (a nested query with a table alias), then you can use WHERE instead of HAVING for no semantic difference, but derived tables may execute differently (MySQL will generally materialize them, PostgreSQL will see through them).

> - Is the NULL from the join because no joined row was found, or because the joined row had a NULL value itself?

If the column is nullable, and you used an outer join, you can't tell. Normally you check for the primary key or some other non-nullable column to discover if a join failed (most often used in anti-join, when you want to find all rows that don't have corresponding rows in the join).