Hacker News new | ask | show | jobs
by supernova87a 1902 days ago
Is there a good place to read from an advanced casual "lay user's" perspective what SQL query optimizers do in the background after you submit the query?

I would love to know, so that I can know what optimizations and WHERE / JOIN conditions I should really be careful about making more efficient, versus others that I don't have to worry because the optimizer will take care of it.

For example, if I'm joining 2 long tables together, should I be very careful to create 2 subtables with restrictive WHERE conditions first, so that it doesn't try to join the whole thing, or is the optimizer taking care of that if lump that query all into one entire join and only WHERE it afterwards? How do you tell what columns are indexed and inexpensive to query frequently, and which are not? Is it better to avoid joining on floating point value BETWEEN conditions?

And other questions like this.

2 comments

You basically only need to know one thing to answer all your questions: use EXPLAIN PLAN. Postgresql has "explain analyze", which is even better than simple "explain", but all SQL databases have "explain", because they are kinda useless without it. The database will tell you what it's going to do (or what it did) and you will decide whether that's ok or whether it's doing something stupid (e.g. full table scan when only 1% of rows is needed), and then you can try things to get the plan that you want (ensuring statistics are up to date, adding indexes, changing the query, etc).

Databases have ways to query the schema which includes the index definitions, so you can know which columns and indexed (and the order of the columns in those indexes).

Unless you materialize a temporary table or materialized view or use a CTE with a planner that doesn't look inside CTEs, the planner will just "inline" your subqueries (what are "subtables"?) and it will not affect the way the join is performed.

Join on floating point value is quite rare. Why do you need to do that?

>Join on floating point value is quite rare. Why do you need to do that?

Ah, thanks for noticing this. They are, for example, (1) tables of timestamped events, and (2) tables of time ranges in which those events need to be associated with (but which unfortunately were not created with that in mind at the time)...

So for example FROM tableA LEFT JOIN tableB ON (timestampA BETWEEN timestampB1 AND timestampB2)

(and where the timestamps can be either floating point or integer nanoseconds)

Since it's a left join, you will get all the rows from tableA, and for each row the matching rows in tableB. If the ranges in tableB are non-overlapping, maybe you have names for time ranges and you want the name of the time range for each row in tableA?

If tableB is large, I don't know what any particular query planner will do with such a query and whether an index on (timestampB1, timestampB2) will help. It should, but use "explain" to check. If tableB has many rows and also has many columns and you only need a few columns, a covering index on (timestampB1, timestampB2) that only has the columns you need can improve perf a lot, because it won't need to refer to tableB itself.

If you use this construction to translate timestamp ranges into calendar ranges, your database might have a function to do that efficiently (convert unix timestamp into datetime, extract year/month/day/etc from the datatime). Or you might need to write a user defined function to do that, in whatever way your database allows (even C). This should be better than a join, IMO.

One alternative rewriting of your query which you maybe did not think of, and which might be crazy or might be plausible, is to use a case statement in the select part, instead of a join. Basically use the info in tableB to generate the SQL for a computed column. If tableB has many rows, this might be worse than a join.

If you want to use "names" from tableB to filter rows in tableA (inner join), and the query should result in a small proportion of the rows from tableA, an index on timestampA is needed. If tableA is really large, it might need to be partitioned on timestampA to filter out whole partitions, but only if you regularly query in such a way that whole partitions can be filtered out at query planning time.

Thanks for that!
My experience is with Postgres, this might vary for other databases. As already said, using EXPLAIN ANALYZE is very useful to see what the planner is doing. This might be hard to read for more complex queries, but it is quite understandable for simple ones.

One of the more important parts is simply understanding which indexes can be used in a query. The other part is understanding when the database will intentionally not use an index, this is mostly related to column statistics. The basics of indexes are pretty simple, but then there is a whole bunch of subtle details that can mean the index can't actually be used for your query.

Another useful part to understand is how much IO a query requires, EXPLAIN (ANALYZE, BUFFERS) is helpful for that. But you also need to understand a bit the layout Postgres uses to store data, how it is stored in pages, TOAST and related stuff.

For Postgres I'd really start with reading the manual on index types and on the statistics collector. After that I'd just play with explain analyze for queries you're writing.

The order of JOINS is optimized automatically in Postgres, but only up to a pointf, for a large number of joins it has to fall back to heuristics.

This is a good explanation about join order optimization: https://www.sqlite.org/queryplanner-ng.html

No database can find perfect join order when you have more than about 8 to 10 tables in the join.