| HN Mirror

Go read his code. He's specifying a bunch of fields like this:

    'zone_2': 0,
    'zone_3': 0,
    ...

My SQL query already handles that simply by making `zone` one of the columns. Seems I missed the `shipment_type`:

    SELECT date, carrier, zone, shipment_type, COUNT(id), SUM(price)
           FROM shipments GROUP BY date, carrier, zone, shipment_type;

That replaces another 7 of his lines (`next_day_air: 0, ...`). It's also future proof - if a new shipment type or zone is added, the bog standard SQL code still works.

Those fields would need to be on the create table. The simplified create table I wrote was simply to illustrate the fact that SQL also handles the issue of "We rarely used most of these entities without the other". Or maybe you would join against them, in which case my 3 line solution becomes 4-5 lines.

It's not for nothing that people like to build SQL on top of Hadoop (see Hive, Impala). SQL queries are nearly always a lot simpler than comparable MapReduce code. (I find these efforts misguided, but separate issue.)