| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by charukiewicz 1913 days ago

Author here. Between this and your other response, where you expound on the same point, I think you're being far too hand wavy about what causes performance issues. The number of joins alone doesn't have much to do with the performance characteristics of any query.

What's more important for performance of queries on larger data sets than the number of joins is that there are indexes and that the query is written in a way that can utilize indexes to avoid multiplicative slowdowns. The reason the UNION query is fast is because the query on either side effectively utilizes the indexes so that the database engine can limit the number of rows immediately, rather than filter them out after multiplying them all together. I can expand this schema to have a UNION query with two 10-table joins and it would still perform better than the 7 table query.

I think someone new to SQL is likely to read your statement and think "okay joins are slow so I guess I should avoid joins". This is not true and this belief that joins are slow leads people down the path that ends at "SQL just doesn't scale" and "let's build a complicated Redis-backed caching layer".

SQL performance is a complex topic. The point of our post was to illustrate that a UNION query can simplify how your join your tables and allow you to write constituent queries that have better performance characteristics. Morphing this into "the number of joins is smaller so the performance is better" is just incorrect.

3 comments

kroolik 1913 days ago

I think it's good to note in the article that the second query is slow because the RDBMS doesn't use indexes (and which ones). Currently, the text is hand waving the problem and moves on.

If the article had, instead, listed indexes, shown they were used in simple cases, shown they weren't used in the second query, dug into why they weren't (maybe they were but it was still hella slow) - that would be a ton of value!

link

slt2021 1913 days ago

Sorry, if I was hand-wavy. I was trying to give other people a simple framework that I use myself (Big O calculated as a number of rows in each table). ALthough I dont know how many rows you have in each table, Big O frameworks still works, because cartesian of tables is being dominated by two huge different datasets (customers's orders being joined to employees' orders). The Union simply calcualtes them separately, rather than doing cartesian of two completely different datasets that represent different entities.

Well written predicates and indexes can help, as well as poorly written predicates make it worse. So there is balance. This is not a shortcut or a silver bullet, it is a trade-off being made. More indexes->faster selects and slower updates/inserts. One bad index=>failed insert and possible losing customer data (happened to me once)

I agree with you that "SQL performance is a complex topic." and one should definitely study query Execution Plan to understand the bottlenecks and make optimization decisions

link

aidos 1913 days ago

Sql queries never really adhere to that simple Big O analysis though. Sure, if you had zero indexes so every operation was nested sequential scans, then it’s Cartesian, but that’s never the case. Most often you get really fast index lookups, by design.

link

slt2021 1912 days ago

I invite you to share execution plans of queries #2 and #4 with the public so that people can decide what is actually slowing down, whether itnis realy cartesian or anything else

link