Hacker News new | ask | show | jobs
by CodeIsTheEnd 1407 days ago
I had never heard of GROUP BY CUBE either! It looks like it's part of a family of special GROUP BY operators—GROUPING SETS, CUBE, and ROLLUP—that basically issue the same query multiple times with different GROUP BY expressions and UNION the results together.

Using GROUP BY CUBE(a, b, c, ...) creates GROUP BY expressions for every element in the power set of {a, b, c, ...}, so GROUP BY CUBE(a, b) does separate GROUP BYs for (a, b), (a), (b) and ().

It's like SQL's version of a pivot table, returning aggregations of data filtered along multiple dimensions, and then also the aggregations of those aggregations.

It seems like it's well supported by Postgres [1], SQL Server [2] and Oracle [3], but MySQL only has partial support for ROLLUP with a different syntax [4].

[1]: https://www.postgresql.org/docs/current/queries-table-expres...

[2]: https://docs.microsoft.com/en-us/sql/t-sql/queries/select-gr...

[3]: https://oracle-base.com/articles/misc/rollup-cube-grouping-f...

[4]: https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.h...

4 comments

I would gladly buy a book of "SQL Recipes" ranging from beginner-level to advanced stuff that uses features like this, ideally with coverage of at least a few popular database systems, but at minimum Postgres.

Is there such a book?

In fact, Yugabyte is giving it away for free - https://downloads.yugabyte.com/marketing-assets/O-Reilly-SQL...
This is great! Thanks for the link. :)
Joe Celko has a number of SQL for Smarties books I've been meaning to look through.
These are phenomenonal books. Your understanding of sql as both a language and as a tool will be transformed.

A bit dated in places but that also is helpful as you learn how a technique works instead of just relying on a vendor specific extension

Yes, PostgreSQL (9 and later), SQL Server (2008 and later), Oracle, SQL Anywhere and DB2 support grouping sets. All had it for longer than a decade.

Some also support the MySQL/MariaDB with rollup syntax introduced in 2018.

Not a pivot table equivalent. Most useful for calculating multiple related aggregates at once for reporting purposes, but ROLLUP doesn't substitute values for columns, ie. it doesn't pivot results on an axis.

MS SQL Server has native pivot support, and Postgres has a limited emulation of it through the crosstab(...) function. https://stackoverflow.com/a/11751905 https://www.postgresql.org/docs/current/tablefunc.html

For folks just learning about ROLLUP et al, I highly recommend this comparison chart for an overview of major features offered by modern relational databases. https://www.sql-workbench.eu/dbms_comparison.html

There's a whole constellation of advanced features out there that arguably most application developers are largely unaware of. (Which explains why most app devs still treat relational databases like dumb bit buckets at the far end of their ORMs.)

I should amend for clarity: GROUPING SETS are really useful for speeding up complex report generation.
I had a situation recently where I had a huge amount of data stored in a MariaDB database and I wanted to create a dashboard where users could interactively filter subsets and view the data. The naive solution of computing the aggregate statistics directly based on the users' filter parameters was too slow, most of the aggregation needed to be done ahead of time and cached. The website's backend code was a spaghetti house of horrors so I wanted to do as much as possible in the DB. (The first time in my career I chose to write more SQL rather than code)

If I had a fancy DB I could use CUBE or GROUPING SETS and MATERIALIZED VIEWs to easily pre-calculate statistics for every combination of filter parameters that automatically get updated when the source data changed. But I had MariaDB so I made do. I ended up with something like this:

    SELECT ... SUM(ABS(r.ilength)) AS distance, COUNT(*) AS intervals FROM r
    GROUP BY average_retro_bucket, customer, `year`, lane_type, material_type, state, county, district WITH ROLLUP
    HAVING average_retro_bucket IS NOT NULL AND customer IS NOT NULL;
"The WITH ROLLUP modifier adds extra rows to the resultset that represent super-aggregate summaries. The super-aggregated column is represented by a NULL value. Multiple aggregates over different columns will be added if there are multiple GROUP BY columns."

So you can query like this to get stats for all districts in CA->Mendocino county:

    SELECT * FROM stats_table WHERE state = 'CA' AND county = 'Mendocino' AND district IS NULL
or like this to get a single aggregate of all the counties in CA put together:

    SELECT * FROM stats_table WHERE state = 'CA' AND county IS NULL AND district IS NULL
However unlike CUBE, WITH ROLLUP doesn't create aggregate result sets for each combination of grouping columns. If one grouping column is a NULL aggregate, all the following ones are too. So if you want to query all the years put together but only in CA, you can't do:

    SELECT * FROM stats_table WHERE year IS NULL AND state = 'CA'
If `year` is null, all the following columns are as well. The solution was to manually implement wildcards before the last filtered group column by combining the rows together in the backend.

I worked around not having materialized views by creating an EVENT that would re-create the stats tables every night. The stats don't really need to be real-time. Re-writing the multiple-GB statistics tables every night will wear out the SSDs in 20 years or so, oh well.