Hacker News new | ask | show | jobs
by FooBarWidget 5762 days ago
People should stop looking for silver bullets that are a) "web scale" (intentionally in quotation marks) and b) super secure/durable/consistent/whatever. It's all trade offs. MongoDB makes sense for some data but not others. Its weaknesses are its strengths and vice versa. Same goes for SQL databases.

We use MongoDB for storing tons and tons of analytics data for which we don't care if some stuff occasionally gets lost in a server crash. The data really fits MongoDB well and it would have been a nightmare if we were to use an SQL database for this. But for bank transactions we wouldn't even consider MongoDB.

The write lock might be a problem for some people. On the other hand MongoDB supports easy sharding, much easier than with SQL. Sharding allows us to scale horizontally which is a huge plus for our data.

4 comments

One of the best things to come out the nosql 'movement' is exactly that, no more silver bullets. As much as I like Mongo, I would never blindly recommend it, or any other data store for that matter. It's all about analyzing what your problem space actually needs, and using the best tool to fill that space.

And, yes, the speed improvements to the ruby driver are very much appreciated :)

My problem is that I'm not a data storage expert. Do I now have to specialize in this field to be able to choose the appropriate persistence technology for a given problem? I have yet to see a good explanation of when different technologies are more appropriate, as most of the discussions I see usually devolve into some sort of SQL-NoSQL flame war. I would like a good, fair resource to explain the pros and cons of different persistence technologies more clearly.
The questions you should ask yourself are the following:

* Does it matter if you lose the last 5sec. worth of updates? The last 5 minutes? The last day?

If you can lose 5sec. worth of updates, a MongoDB replication pair is just fine. If you can lose a day's worth of updates (or can easily reconstruct the database contents from other sources), you can try out pretty much anything without bad repercussions. If you can't lose anything, you're pretty much limited to the most conservative databases (the SQL bunch).

* What's the most obvious unit of data that you're working with?

If you always update single values (or add things to lists/sets), Redis is an excellent choice. If you have fixed-size records, SQL or one of the table-based options (Cassandra, Hbase) may be for you. If you have documents with substantial internal structure, a document store (MongoDB, CouchDB, or Lotus Notes if you want something expensive and commercial) would be a good option.

* How much data do you have?

If all of your data fits into memory (and for the price of another server, you may well get enough memory to fit all of your data), you can go pretty far with a single server. If it fits on a single set of hard disks, you'd want replication, not sharding, so that the risk of losing data is minimal. If your data is much larger than that, your only hope is a sharding setup - either with SQL+spit+glue, or Cassandra/HBase, or some version of MongoDB where sharding is stable enough for production use (I do remember seeing warnings - so the current version may or may not fit that description).

Thanks for the response. I will mull that over next time I have a say in the matter of how to store data. I think at some point I will also just need invest some time to play with a few of the NoSQL choices to get a better feel for them.
I fail to see why analytics data seem to be considered "low quality" data ("we don't care if some stuff occasionally gets lost"). As far as I can tell, most businesses out there are driven by metrics which are derived from analytics data... so I don't agree that "it's OK to lose some".
I use MySql and Redis for persistence, depending on the type of data. Both get written to on a purchase: MySql upgrades the account info, Redis holds my A/B testing stats which just had several tests score points. If the MySQL write fails, I have a CS emergency because my customer can't get what she paid for and I probably just ruined a lesson plan for tomorrow. If of the Redis writes fails, my A/B test results that I won't look at for a week anyhow shift in a way that almost certainly doesn't alter my final decision.

It is absolutely OK to lose analytics data occasionally, and indeed with the variety of ways to bork that (js is off, user agent prefetches undisplayed page, bot action, etc) if your stats aren't robust against it you are screwed anyhow.

If they use it for statistical analysis then the sample size decrease a bit, but likely not in a significant way. If they lost all their data that would be a bit different.

For instance, I routinely delete web server logs older than 30 days, on the assumption that if I didn't need it in the last 30 days I'll probably never need it. Every now and then this bites me and I need more than 30 days data to test some assumption, then I will just have to wait for a bit. (for events that occur infrequently enough).

What jacquesm said. It depends on the kind of analytics data; our data is not as important as someone else's analyitcs data. For us it's more important writes are of reasonable speed, that we can store the data in arbitrary structures in a schemaless way, that we can define arbitrary search indexes and that the data can be horizontally spread across multiple servers.

Website visit counters are a great example. I don't think many people care if a few visits get lost once in a while.

Unrelated comment: thanks a bunch for the speed optimisations pushed into the ruby mongo driver :)
some humor to validate your point :) http://www.xtranormal.com/watch/6995033/