Hacker News new | ask | show | jobs
by wolf550e 1903 days ago
You basically only need to know one thing to answer all your questions: use EXPLAIN PLAN. Postgresql has "explain analyze", which is even better than simple "explain", but all SQL databases have "explain", because they are kinda useless without it. The database will tell you what it's going to do (or what it did) and you will decide whether that's ok or whether it's doing something stupid (e.g. full table scan when only 1% of rows is needed), and then you can try things to get the plan that you want (ensuring statistics are up to date, adding indexes, changing the query, etc).

Databases have ways to query the schema which includes the index definitions, so you can know which columns and indexed (and the order of the columns in those indexes).

Unless you materialize a temporary table or materialized view or use a CTE with a planner that doesn't look inside CTEs, the planner will just "inline" your subqueries (what are "subtables"?) and it will not affect the way the join is performed.

Join on floating point value is quite rare. Why do you need to do that?

1 comments

>Join on floating point value is quite rare. Why do you need to do that?

Ah, thanks for noticing this. They are, for example, (1) tables of timestamped events, and (2) tables of time ranges in which those events need to be associated with (but which unfortunately were not created with that in mind at the time)...

So for example FROM tableA LEFT JOIN tableB ON (timestampA BETWEEN timestampB1 AND timestampB2)

(and where the timestamps can be either floating point or integer nanoseconds)

Since it's a left join, you will get all the rows from tableA, and for each row the matching rows in tableB. If the ranges in tableB are non-overlapping, maybe you have names for time ranges and you want the name of the time range for each row in tableA?

If tableB is large, I don't know what any particular query planner will do with such a query and whether an index on (timestampB1, timestampB2) will help. It should, but use "explain" to check. If tableB has many rows and also has many columns and you only need a few columns, a covering index on (timestampB1, timestampB2) that only has the columns you need can improve perf a lot, because it won't need to refer to tableB itself.

If you use this construction to translate timestamp ranges into calendar ranges, your database might have a function to do that efficiently (convert unix timestamp into datetime, extract year/month/day/etc from the datatime). Or you might need to write a user defined function to do that, in whatever way your database allows (even C). This should be better than a join, IMO.

One alternative rewriting of your query which you maybe did not think of, and which might be crazy or might be plausible, is to use a case statement in the select part, instead of a join. Basically use the info in tableB to generate the SQL for a computed column. If tableB has many rows, this might be worse than a join.

If you want to use "names" from tableB to filter rows in tableA (inner join), and the query should result in a small proportion of the rows from tableA, an index on timestampA is needed. If tableA is really large, it might need to be partitioned on timestampA to filter out whole partitions, but only if you regularly query in such a way that whole partitions can be filtered out at query planning time.

Thanks for that!