Hacker News new | ask | show | jobs
by magicalhippo 637 days ago
I'll add some of mine:

Learn your DB server. Check the query plans often. You might get surprised. Tweak and recheck.

Usually EXISTS is faster than IN. Beware that NOT EXISTS behaves differently than EXCEPT in regards to NULL values.

Instead of joining tables and using distinct or similar to filter rows, consiser using subquery "columns", ie in SELECT list. This can be much faster even if you're pulling 10+ values from the same table, even if your database server supports lateral joins. Just make sure the subqueries return at most one row.

Any query that's not a one-off should not perform any table scans. A table scan today can mean an outage tomorrow. Add indexes. Keep in mind GROUP BY clause usually dictates index use.

If you need to filter on expressions, say where a substring is equal something, you can add a computed column and index on that. Alternatively some db's support indexing expressions directly.

Often using UNION ALL can be much faster than using OR, even for non-trivial queries and/or multiple OR clauses.

edit: You can JOIN subqueries. This can be useful to force the filtering order if the DB isn't being clever about the order.

7 comments

The most useful thing is learning your DBMS. There's no escaping the performance and isolation quirks of each one, and there are different bonus features in each.

One interesting thing I found about Postgres that's probably true of others too, often you can manually shard INSERT (SELECT ...) operations to speed them up linearly with the number of CPU cores, even when you have like 10 joins. EXPLAIN first, find the innermost or outermost join, and kick off a separate parallel query operating on each range of rows (id >= start AND id < end). For weird reasons, I relied on this a lot for one job 6 years ago. Postgres has added parallelism in versions 10+, but it's still not this advanced afaik.

It would be cool to see something automating this conversion (and other similar performance workarounds) available as a first-class feature in a SQL IDE.
I ended up building jank Python-based tooling around it to sorta automate it. You select your key and it decides the ranges for you.

Wonder if it'd be useful at all for live applications or just data processing. For the former, would need to somehow execute all reads at the same MVCC version even though they're separate connections.

Instead of joining tables and using distinct or similar to filter rows, consiser using subquery "columns", ie in SELECT list.

What does this mean? Running

    SELECT
      column1,
      (
        SELECT column2, column3, ...
        FROM table_b
        WHERE table_a.id = table_b.a_id
      )
    FROM table_a
Results in "subquery must return only one column" as I expected. You mean returning the multiple columns as a record / composite type?

Keep in mind GROUP BY clause usually dictates index use.

The reason for this wasn't immediately apparent to me. For those who were curious, this blog post walks through it step by step: https://www.brentozar.com/archive/2015/06/indexing-for-group...

> > Keep in mind GROUP BY clause usually dictates index use.

> The reason for this wasn't immediately apparent to me.

The key thing to remember is that grouping is essentially a sorting operation, and it happens before your other sorts (that last part isn't necessarily as obvious).

Sorry, was on mobile so hadn't patience to type examples.

    SELECT
      column1,
      (
        SELECT column2
        FROM table_b
        WHERE table_a.id = table_b.a_id
      ) as b_column2,
      (
        SELECT column3
        FROM table_b
        WHERE table_a.id = table_b.a_id
      ) as b_column3
    FROM table_a
It might look like a lot more work, but in my experience it's usually a lot faster. YMMV but check it.
How well that performs compared to a JOIN can vary massively depending on the data sizes of table_a & tale_b, how table_b is indexed, and what else is going on in the query.

If table_b has an index on id,column2,column3 (or on id INLUDEing column2,column3) I would expect the equivalent JOIN to usually be faster. If you have a clustered index on Id (which is the case more often than not in MS SQL Server and MySQL/InnoDB) then that would count for this unless the table is much wider than those three columns (so the index with its selective data would get many rows per page more than the base data).

Worst (and fairly common) case with sub-selects like that is the query planner deciding to run each subquery one per row from table_a. This is not an issue if you are only returning a few rows, or just one, from table_a, but in more complex examples (perhaps if this fragment is a CTE or view that is joined in a non-sargable manner so filtering predicates can't push down) you might find a lot more rows are processed this way even if few are eventually returned due to other filters.

There are times when the method is definitely faster but be very careful with it (test with realistic data sizes and patterns) because often when it isn't, it really isn't.

> perhaps if this fragment is a CTE or view

Yeah I guess I should have specified that this technique usually works best when done in the outer query, not buried deep inside.

It can be particularly effective if you fetch partial results, ie due to pagination or similar.

That said, these things aren't set in stone. I shared my experience, but my first tip goes first :)

You have to be careful here that a one-to-many relationship doesn't exist and returns more than 1 row -- it'll cause an error and halt your query
Yes, as I noted.

Frequently this is trivial, sometimes it's not.

If there will be multiple hits but it doesn't matter that much, there's the obvious TOP 1 or MIN(col) and such.

It's a tradeoff between accidentally breaking the query and returning unexpected data.

Note that if you used join you could have bigger issues as the join would succeed but now you got multiple rows where you didn't expect.

Are there any tools or tips to help speed up the "which JOIN is duplicating data" hunt?

Usually my biggest problem is getting all the query parameters lined up to reproduce the issue! (Being able to flip on extended logging or a profiler can make this easy.)

Cutting out the result columns when disabling JOINs to narrow it down is straightforward but tracking columns down in WHERE clauses quickly tends not to be.

Good question. Obviously a profiler or similar that can capture the details when it happens helps, as you note.

If you can reproduce the issue then what I tend to do is to include the unique id column from each joined table (we try to avoid natural keys).

If it doesn't have a unique id column I replace the join with a subquery that includes row_number(), so I can se which one that doesn't repeat.

But without being able to replicate, I don't know of any better way than just studying the ON conditions carefully.

Would a cross apply accomplish the same result without the risk of multiple rows?

Cross apply (select top 1 ... ) x

Agreed. Learn how to use EXPLAIN and interpret using whatever tools you prefer. Also monitor your queries. Something I did in a previous start up was install PgHero, and man did that tool help me optimise and prioritise performance.
PgHero looks great (props for linking to related projects!), though somehow not discussed here much previously.

https://github.com/ankane/pghero

https://news.ycombinator.com/item?id=41299148#41300220 (mentioned among the author's other helpful tools and Ruby gems)

> helpful to identify slow queries in production, remove duplicate indexes, see missing indexes, keep an eye on table size, etc

--

I haven't recently put in the effort to find a copy of SQL Sentry from back when the full-featured edition was briefly free but even the "always free" version was helpful working with MSSQL query plans.

https://www.solarwinds.com/free-tools/plan-explorer

(NOTE: Not sure how free-but-pushy it is these days, but years ago it wasn't bad.)

> Learn your DB server. Check the query plans often. You might get surprised. Tweak and recheck.

Oftentimes the well-designed queries behave unexpectedly, because the column statistics are not updated or when the data is fragmented for big tables (e.g. random PK insertion).

Sounds like that DBMS would work better with serial int PKs
> Any query that's not a one-off should not perform any table scans. A table scan today can mean an outage tomorrow

I disagree.

There are queries where a table scan is the most efficient access strategy. These are typically analytical/aggregation queries that usually query the whole table. And sometimes getting only 50% of all rows is better done using a table scan as well.

I also don't see how a (read only) "table scan" could leave to an outage. It won't block concurrent access. The only drawback is that it results in a higher I/O load - but if the server can't handle that, it would assume it's massively undersized.

I mentioned in a different reply that I did not have analytic queries in mind. I don't work with that so forgot to specify.

Outage might "just" mean slow enough that customer can't get their work done in time. For the customer it's the same.

> Just make sure the subqueries return at most one row.

The JSON functions most RDBMS offer are awesome for that. One subquery to get a JSON if you have multiple results for the field then you only have to decode it on the app side.

> Any query that's not a one-off should not perform any table scans. A table scan today can mean an outage tomorrow.

That very much depends on your data.

I should have noted that I was talking about application workloads. I don't have much experience with analytics workloads.

If you have something else in mind, do feel free to elaborate.

Relevant for applications as well, when a table only has a few thousand entries, a scan is not the end of the world and not even an outage in waiting.

I agree with you that one should seek when possible as part of normal query optimization, but depending on your data, it could also just easily be something you can live with forever.

You can’t control the growth rate of your tables, you can only estimate it. When we design for reliability we want to remove single cause failures and make a best effort to reduce dual cause failures. We definitely don’t want two failures from a single cause.

What reason might the lack of indexes suddenly become a critical issue? And what other things might you be scrambling to deal with at the same time? Tables might fill quickly when a favorable review comes in, or some world even results in churn in your system.

Just make the damned index. You Are Going to Need It. And what’s the harm in making it?

Depending on your application, you can very accurately estimate it. And in a case I had yesterday, it involved stringy numbers because of a third party system, so I could indeed add a computed persisted column that converts our number to a VARCHAR, add a 9th index with a lot of fields on that computed column and then save… almost nothing compared to just scanning 6k rows.
We had one such table for years. The one day I get an emergency call from support, big customer don't get their responses and it's critical for their workflow.

After some digging I found the service generating the responses got killed due to being unresponsive.

Turns out our customer got a new client which caused them to suddenly generate 100x as much data as others in this module. And that caused a lot more data in a table that joined this non-indexed table.

So everything was working, it was just the performance went over a cliff in a matter of days due to the missing index.

Added the required index and it's been humming ever since.

I've had similar experiences, and so these days I'm very liberal with indexes.

We have read-heavy workloads, if you mostly insert then sure be conservative.

“And what’s the harm in making it?“

Increased storage and slower inserts?

In a table you think isn’t growing?? No.
As usual, there are well-qualified exceptions. If you are very certain the table scan can't hurt, sure. But in my experience, an index wouldn't hurt any in those cases.