Hacker News new | ask | show | jobs
by MarkusWinand 920 days ago
No question, such a query should not be written. That's probably the reason why this odd behavior, which is even different in various DBMSes, is not causing everyday problems.
1 comments

The reason the behavior doesn't cause problems is because everybody treats automatic aggregation as a voodoo where they know one recipe that works and anything different is domain of the Devil.

And, IMO, that's a very sane and reasonable way to treat it. The entire idea of automatic aggregation is flawed, and those queries should just have a `group by ()` explicit at the right place.

I agree, I think the original sin here is the fact that whether a `SELECT` is an aggregation is determined by the contents of the scalar expressions at all. I think most of this weirdness comes directly out of wanting to be able to write both `SELECT sum(x) FROM xx` and `SELECT x FROM xx` and have them work.

Not that I have a better solution offhand, in SQL grouping by a constant value is not actually the same as not writing `GROUP BY` at all since the behaviour on empty tables is different.

What's an aggregation per se? A SQL query is best thought of an arbitrary generator function. You can, using functions like UNNEST, end up emitting multiple row-tuples for each processed input row-tuple. An aggregation is just a generator function that happens to reduce over all the input rows and then emit one row-tuple when there's no more input. Query-planning engines do not special-case this. It's just a generator node like any other generator node.

Consider: using window functions, you can do partial aggregations over subsets of the input — without even necessarily partitioning the input (i.e. you can compute "running totals" and other wacky state-machine-like outputs.)

Not writing `group by` is the same as writing `group by ()`.

And yeah, the difference between that and a value is one of those really surprising things on SQL that actually make sense and should be this way. Unfortunately, there are many of those.

Ah my apologies, I wasn't familiar with that syntax.
No need to apologize. My post didn't make that clear at all.
Would you argue that automatic scalar-ism is also flawed, and the query

    SELECT a FROM aa;
should have an explicit grouping, like

    SELECT a FROM aa GROUP BY id;
?

After all, when you think about it, it's really not the aggregate functions that break expectations, it's the scalars. Of the four combinations of having or not having an explicit `GROUP BY` and having or not having an aggregate function, three of them have the unsurprising behavior of returning exactly one row per grouping.

    -- aggregate function and group by - one single result per grouping
    SELECT sum(a) FROM aa GROUP BY a; --or
    SELECT sum(a) FROM aa GROUP BY ();


    -- no aggregate function, but a group by - one single result per grouping
    SELECT a FROM aa GROUP BY a;

    -- just aggregate function, but no group by - one single result per (single, implicit) grouping
    SELECT sum(a) FROM aa;

But then, when you have neither an aggregate function nor an explicit `GROUP BY`, it breaks expectations:

    -- no aggregate function, no group by - one result row per row 
    -- in the source set, even though there should be only one big implicit group
    SELECT a FROM aa;
Therefore, I propose that the next SQL standard should introduce a new `GROUP BY INDIVIDUAL ROW` keyword, that henceforth all "scalar" queries MUST use in order to have consistent behaviour with the rest of the language.
What is flawed is not that there is an implicit grouping on all queries. It's that the implicit grouping changes, depending on a set of rules that consider stuff written in several places that are not explicitly related to it.

You are asking if it makes sense to have an implicit grouping at all; it very obviously absolutely does. And grouping by individual row is the very obvious default. But I do totally support adding that keyword expressing the default. All defaults should be expressible.