Hacker News new | ask | show | jobs
by yummyfajitas 4572 days ago
The "no migrations" claim is a lie. Migrations still exist, they just happen at the time of reading:

    def parseMongoRow(jsonBlob):
        x = None
        if jsonBlob['date'] < 2012/12/02:
            x = jsonBlob['foo']['bar']
        else if jsonBlob['date'] < 2013/06/01:
            x = mergeFields(jsonBlob['bar'], jsonBlob['baz'])
        else:
            x = jsonBlob['x']
        ...
For expiring fields in a redis database, this isn't a big deal. Your code stinks between [time of migration, time of migration + ttl]. For a permanent datastore, ouch.

I also do not understand why they are using MongoDB for this. They describe a schema involving shipments, shippoints and orders, and one of these things does not occur without the other (an attempt at justifying the document based data model?). I.e., something like this:

    CREATE TABLE orders (...)

    CREATE TABLE shipments (
      order_id BIGINT REFERENCES orders(id) NOT NULL,
      ship_from_id BIGINT REFERENCES shippoints(id) NOT NULL,
      ship_to_id BIGINT REFERENCES shippoints(id) NOT NULL
    )
You can't have a shipment without a shippoint. Black magic!

They have several hundred shipments/day, and lets be generous and assume there are 100 updates/notes to a shipment. We are talking maybe 50,000 inserts/day (omfg big data, invest in a 1TB hard disk!) and it sounds like data that's considerably more important than an ad impressions or pageviews. (Well actually it fits in ram, so maybe the 1TB hard disk on a dedicated server is overkill.)

Also, consider the daily aggregate generator in 3 lines of SQL rather than 236 lines of javascript:

    SELECT date, carrier, zone, COUNT(id), SUM(price)
          FROM shipments
          GROUP BY date, carrier, zone;
I don't get it. How is mongodb even remotely the right tool for this job?
3 comments

Mongo also stores the same column/attribute names with every single row. 50000 inserts? 50000 (usually identical) sets of attribute names stored (compared to once for SQL). And to my knowledge they are stored uncompressed. For a Big Data database, it doesn't seem very good at efficiently storing big data.
This could be a problem, for sure. There are some tools on top of MongoDB that tries to reduce it, for example mongoengine will allow you to define a "compressed name" to store in the DB, meaning that you'll see 'timestamp' on your codebase, but that will be stored as 't'.
I've seen similar problems with systems that have supported multiple versions of the same XML schema at the same time - rather than taking the hit of migrating existing data to new schemas you end up having every more complex application code that has to check where a particular value may be stored (because nobody, in my experience, every bothers to store what version of a schema a particular document used).

This is particularly bad if these changes are small and incremental as each one in isolation looks reasonable but the cumulative effect over a few years can be a complete nightmare.

Condescending much?

Yes, migration happen at read time (translating on the fly may be a possibility without rewriting old documents as well)

Of course doing this is easier than adding manually columns every time you need to change something. PGSQL fans love to say how this works wonderfully in theory but in practice it's not as pretty as it looks like.

"SELECT date, carrier, zone, COUNT(id), SUM(price) FROM shipments GROUP BY date, carrier, zone;"

Nice try, but no.

This select doesn't even match your proposed CREATE TABLE above. It's at least missing a JOIN.

And the 236 lines of JS, well, there's a lot of specifying fields, (22 of them), so, if you do that on SQL you'll end up with lots of lines as well or at least an ugly to read line.

"I don't get it. How is mongodb even remotely the right tool for this job?"

Of course, it's hard to find someone who uses MongoDB respecting its limitations and using it appropriately, most MongoDB hating posts are similar to "I'm trying to attach a trailer to my motorbike and it doesn't work, I also tried taking it onto a lake and it broke, this sucks"

Go read his code. He's specifying a bunch of fields like this:

    'zone_2': 0,
    'zone_3': 0,
    ...
My SQL query already handles that simply by making `zone` one of the columns. Seems I missed the `shipment_type`:

    SELECT date, carrier, zone, shipment_type, COUNT(id), SUM(price)
           FROM shipments GROUP BY date, carrier, zone, shipment_type;
That replaces another 7 of his lines (`next_day_air: 0, ...`). It's also future proof - if a new shipment type or zone is added, the bog standard SQL code still works.

Those fields would need to be on the create table. The simplified create table I wrote was simply to illustrate the fact that SQL also handles the issue of "We rarely used most of these entities without the other". Or maybe you would join against them, in which case my 3 line solution becomes 4-5 lines.

It's not for nothing that people like to build SQL on top of Hadoop (see Hive, Impala). SQL queries are nearly always a lot simpler than comparable MapReduce code. (I find these efforts misguided, but separate issue.)