Hacker News new | ask | show | jobs
by jrochkind1 1407 days ago
How can querying for PK's than doing a second query on the FK be faster than a join that's semantically equivalent? Postgres optimizer gone terribly wrong or something? Or am I misunderstanding what's going on?

It would never have occurred to me to even try this, I would have assumed postgres would do as well as could be done at what looks like a very standard join. So I guess that's a lesson not to make assumptions?

Not sure, without even having a theory of why pg would behave this way, I'd be worried that as the data changes it might regress again. Or if the phase of the moon changes or something -- since it seems like magic!

6 comments

It’s almost never faster to do two queries. But DBs are complicated query engines with sophisticated compilers, so you have to go in and try to convince the compiler to optimize the way you want that will make the query fast. EXPLAIN_ANALYZE is your best friend for understanding what the compiler thinks is best. And then you can adjust the query to convince it that it should do something faster.

In this specific case, I’d bet that having the FROM and JOIN tables reversed would be enough to get even better performance than 2 queries: `SELECT * FROM os JOIN o …`.

It's a failure of the optimiser. AS a DB guy (MSSQL, not postgres), there's always a risk of this, you always have to be aware of this, and don't assume. MSSQL's optimiser is pretty good but I've also seen slowdowns on plain queries after adding an index. Optimisers are complex and individual optimisation rules interact in odd and sometimes counterproductive ways. Also if your stats are not up-to-date, trouble is guaranteed.
You're right. I had the same thought.

Frankly, it's such a simple case that I'm hesitant to pin the problem on Postgres. I'm now inclined to believe that the ORM/query builder they're using (which I'm not familiar with) is generating a query that does something unusual. I'm also having a hard time understanding the code, so it may just be a bad query in the first place. If the generated SQL was shown, it would be much easier to look into.

Generally speaking, if you're just joining some data onto the results of an expensive operation, it's exactly the same as doing a second query, just without the round trip and overhead.

The lesson here is to always look at EXPLAIN ANALYZE any time you're doing Postgres optimization work. There's a wealth of useful information that will often point you to your mistake. Databases are quite often extremely good at doing common-sense operations well.

We secretly agree and aren't sure why it's faster to do two queries, but it measurably is. We're going to try some of the suggestions littered in this conversation and will report back, this time with some EXPLAIN output. We appreciate the suggestions and theories.
While I generally wonder the same question... I've seen basically every database follow this pattern. There are fairly frequent scenarios where doing a couple separate indexed queries performs better than an equivalent join. Particularly when queries get kinda-large and up.

I can certainly see it when running into memory limits per table, because doing the full join might require more memory at once, but it frequently happens far below where I'd expect that to occur (in the low thousands of rows). Dunno why. Maybe it uses caches more efficiently? Many simple operations are often more cache-able than a few complex ones.

Maybe disk spill because some joins require a superlinear amount of space and doesn’t fit in memory buffers?
I’ve seen it happen when people forget to index the foreign key columns, it also may be faster on one to many joins if you need to use DISTINCT to remove duplicate rows (instead of joining on a specific subquery that eliminates the duplication implicitly).