Hacker News new | ask | show | jobs
by magicalhippo 637 days ago
Sorry, was on mobile so hadn't patience to type examples.

    SELECT
      column1,
      (
        SELECT column2
        FROM table_b
        WHERE table_a.id = table_b.a_id
      ) as b_column2,
      (
        SELECT column3
        FROM table_b
        WHERE table_a.id = table_b.a_id
      ) as b_column3
    FROM table_a
It might look like a lot more work, but in my experience it's usually a lot faster. YMMV but check it.
3 comments

How well that performs compared to a JOIN can vary massively depending on the data sizes of table_a & tale_b, how table_b is indexed, and what else is going on in the query.

If table_b has an index on id,column2,column3 (or on id INLUDEing column2,column3) I would expect the equivalent JOIN to usually be faster. If you have a clustered index on Id (which is the case more often than not in MS SQL Server and MySQL/InnoDB) then that would count for this unless the table is much wider than those three columns (so the index with its selective data would get many rows per page more than the base data).

Worst (and fairly common) case with sub-selects like that is the query planner deciding to run each subquery one per row from table_a. This is not an issue if you are only returning a few rows, or just one, from table_a, but in more complex examples (perhaps if this fragment is a CTE or view that is joined in a non-sargable manner so filtering predicates can't push down) you might find a lot more rows are processed this way even if few are eventually returned due to other filters.

There are times when the method is definitely faster but be very careful with it (test with realistic data sizes and patterns) because often when it isn't, it really isn't.

> perhaps if this fragment is a CTE or view

Yeah I guess I should have specified that this technique usually works best when done in the outer query, not buried deep inside.

It can be particularly effective if you fetch partial results, ie due to pagination or similar.

That said, these things aren't set in stone. I shared my experience, but my first tip goes first :)

You have to be careful here that a one-to-many relationship doesn't exist and returns more than 1 row -- it'll cause an error and halt your query
Yes, as I noted.

Frequently this is trivial, sometimes it's not.

If there will be multiple hits but it doesn't matter that much, there's the obvious TOP 1 or MIN(col) and such.

It's a tradeoff between accidentally breaking the query and returning unexpected data.

Note that if you used join you could have bigger issues as the join would succeed but now you got multiple rows where you didn't expect.

Are there any tools or tips to help speed up the "which JOIN is duplicating data" hunt?

Usually my biggest problem is getting all the query parameters lined up to reproduce the issue! (Being able to flip on extended logging or a profiler can make this easy.)

Cutting out the result columns when disabling JOINs to narrow it down is straightforward but tracking columns down in WHERE clauses quickly tends not to be.

Good question. Obviously a profiler or similar that can capture the details when it happens helps, as you note.

If you can reproduce the issue then what I tend to do is to include the unique id column from each joined table (we try to avoid natural keys).

If it doesn't have a unique id column I replace the join with a subquery that includes row_number(), so I can se which one that doesn't repeat.

But without being able to replicate, I don't know of any better way than just studying the ON conditions carefully.

Would a cross apply accomplish the same result without the risk of multiple rows?

Cross apply (select top 1 ... ) x