Hacker News new | ask | show | jobs
by Svip 2406 days ago
While counting columns will not include NULL columns, how about counting joined tables?

  SELECT a.id, COUNT(b.*) FROM a JOIN b ON b.a_id = a.id GROUP BY a.id
is not permitted in Postgres.

Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more complicated example might not allow for that. For instance if it was a virtual table.

2 comments

I'm sorry, if you want to count NULL-rows in b.* , how will it ever be different from just COUNT( * )? Maybe I'm misunderstanding what you're after?
You're right, it's a bad example, imagine if I joining two tables:

  SELECT a.id, COUNT(b.*), COUNT(c.*)
    FROM a JOIN b ON b.a_id = a.id JOIN c ON c.a_id = a.id
    GROUP BY a.id
I want to know how many occurrences a_id has in both table b and c. Again in this simple example, I could just count on b.a_id and c.a_id, respectively, but imagine if b and c were complex virtual tables:

  JOIN (SELECT NULL AS foo, 1 AS bar 
        UNION SELECT 1 AS foo, NULL AS bar) b ON b.foo = a.id OR b.bar = a.id
This would be useful if we are aggregating data together, where essentially, there are two ways to join the data with the main table, and both columns can be null.

Of course, in this example, you could count by going COUNT(b.foo) + COUNT(b.bar), but that's a bit awkward, or a column in table b you know to never be null. But what if you don't? And still have table c next to it?

Yes, in all cases, there would be a way out. In the extreme case, you could wrap it in a virtual table, where you add a column that is just always 0 (not null), so you can count on it. It would just be neat if b.* was possible.

I might misunderstand, but I think the join just isn't the right approach for that sort of child record counting, ie. counting records from two or more independent child tables associated with your rows (if that is what you're wanting).

You're grouping and counting after the three way join. That join will involve all combinations of child records between the two child tables associated with any given parent row (almost never what is wanted). So any given non-null thing you're counting from one child record will appear multiple times, = the number of child records in the other table associated with the parent row.

I think you just want to use correlated subqueries to count the child records: select a.id, (select count(whatever) from child1 c1 where c1.a_id = a.id), (select count(whatever) from child2 c2 where c2.a_id = a.id) ...

TLDR: You almost never want to join independent children to a common parent, use independent correlated subquery expressions instead.

Except, that's potentially super slow if the optimiser does not realise what to do. In its default state, it will make two table look ups for each row in table a. So that's 1+N*2 look ups compared to 3 look ups in my example.

For little data, that's probably fine, but for a big database, it will be slow. However, the optimiser may be able to handle that? I know Sybase's and MSSQL's had trouble with it, but I've heard Postgres' might be able to.

First, correctness first. Second, correlated subqueries in selects really aren't noticably slow at all in my experience (Postgres, Oracle) when they're really needed, meaning alternatives are as slow or slower. Especially compared to the alternative of actually doing a cartesian product across independent children of any size. I don't think the optimizer could help much with avoiding that cartesian product to be actually realized, either - the grouping after that join is going to be a big sort in general with results that depend on the distribution of child values on a row-by-row basis. But in any case it would give the wrong answers for both children regardless of speed and memory used!
I guess I didn't make myself clear either.

If there's an inner join, then there is a matching row. It will be counted by a normal COUNT( * ). On an outer join, columns in a 'missing' row will be represented as NULL.

You're saying you wish there were two NULLS, 'missing value in table null' and 'missing row in join null', and that you could count the first one?

No, not that complex. Just that I want to be able to specify which of my joined tables I am counting on in a null-safe way, in case the joining table could have null values in all its columns.
Any reason a regular count() wouldn't work?

SELECT a.id, count() FROM ...

It would, I realised after I read the replies, that I should have used the two joined table scenario as per my reply to your sibling.