Hacker News new | ask | show | jobs
by metamemetics 5672 days ago
I am investigating replacing MySQL with MongoDB in my model layer for my next prototype.

My mind is still thinking in 3NF though. I understand dernormalization and avoiding joins will be useful from a performance standpoint. However I am unsure when to either go ahead and include a foreign key, retrieve it and perform a second query from the application layer, and when to go ahead and duplicate\embed all the field data. I'm leaning towards just making 2 simple sequential key lookup queries, the 2nd on the retrieved foreign key, rather than duplicating fields everywhere and keeping track of massively cascading changes. Instead of performing 1 MySQL Join. Although I usually think in terms of minimizing roundtrips to the database server.

Wondering if anyone has a heuristic for this or suggested reading?

2 comments

Please ask yourself why. In reality, MongoDB has almost all of the same limitations of MySQL and PostgreSQL, but lacks the production proven record. In addition, MongoDB has very weak durability guarantees on a single server, poor performance for data that is not in-memory, and continues some common SQL/ACID scalability pitfalls (use of arbitrary indexes and ad-hoc queries).

Outside of this, you need to switch the question you ask from "what data do I need to capture?" to "what questions do I need to answer?"

>Please ask yourself why.

Looking for a way to hands-free scale very cheaply without vendor lockin. Would be nice if I can simply add another machine to the database to the cluster. And not have to generate the ids in the application layer, use hashing algo to select correct machine, and have everything stop working when a single database goes down. Seems like it should be a solved problem by now. Investigating new tech won't set me back much time, and my MySQL queries aren't going to disappear if I don't like it. Furthermore looking at large sites such as Flickr that massively scaled MySQL it seems like they stopped using its relational features anyway.

It's not as "hands-free" as you'd like to believe. Check out the MongoDB sharding introduction[1]. There are some pretty big caveats. Very few people are using auto-sharding at scale in production (bit.ly and BoxedIce are all I know of).

There are other operational issues with MongoDB. MongoDB can only do a repair if there is twice the available disk space as the database uses, and the server must be effectively brought offline to do this. To reclaim unused disk space, you have to do a, you guessed it, compact/repair. Want to do a backup? The accepted way to do this is to have a dedicated slave that can be write-locked for however long it takes to do your backup. They suggest using LVM snapshots to make this short, but disk performance on volumes with LVM snapshots is terrible.

I would consider using MongoDB for a setup that would either be either non-critical, completely within memory with bounded growth (which itself sort of begs the question...), or involve mostly write-once data, such as feeds, analytics, and comment systems.

[1] http://www.mongodb.org/display/DOCS/Sharding+Introduction

Well my platform is n number of $20 linodes to start. I'm clustering the python application across them using uwsgi+nginx (all I have to do is add an IP address in the config to scale), it's going to be a given that I shard the database across them as well. If you feel I should avoid Mongo would you recommend Cassandra instead?

Regardless, I think my initial question regarding when to denormalize data applies to any database including scaled MySQL, but perhaps was a better question for stackoverflow.

Cassandra has it's own hurdles, but I think if we're talking about getting your mind in the right place, it might be a better answer. Cassandra definitely has a much more mature scalability implementation that isn't caveat-ridden like MongoDB is. It's operating at scale at both Twitter and Facebook.

Cassandra has online compaction, but still requires up to 2x space for compaction. However, Cassandra does not have to do a full scan of the entire database to do compaction, and almost never actually uses the 2x space. It's also much easier to maintain a Cassandra cluster, because each instance shares equal responsibility, and replication topology is handled for you.

Despite what their fans will say, these are both beta-quality products.

Care to back this up? I think MongoDB is awesome from what I've seen so far and I've heard nothing but good thiings about it even in the performance sector. I'd love to see more info on what you're saying. Thanks.
MongoDB is designed specifically for multi-server durability. All writes are done in-place. It will support single-server durability in 1.8. Until then, 10gen strongly recommends against using a single-server setup with any data you care about[1].

I don't have any ready figures in front of me for disk performance, but MongoDB uses an in-memory layout and memory-mapped files, which provide a far less than optimum data layout for disk performance. As you might imagine, it works well with SSDs, but the performance is awful on rotational disks. Foursquare's outage was caused in part when their MongoDB instance began to exceed available RAM. My own experiences with larger-than-RAM MongoDB collections mirror this conclusion. Under these circumstances, you're likely to see much worse performance with MongoDB than MySQL or PostgreSQL, especially with concurrent writes[2].

As far as the SQL pitfalls, and database in general, don't think for a second MongoDB has some magic that exempts it from performance problems that RDBMS' have. Start using any of the familiar SQL scaling "no-no" features: multiple indexes, range queries, limits/offsets, etc.... and it's going to start exhibiting performance characteristics like a PostgreSQL or MySQL database would under those circumstances.

The temptation with MongoDB is to be lured into this idea that a brand new, immature database with convenient, SQL-like features can perform significantly better than it's highly-tuned RDBMS brethren. There are some reasons to choose MongoDB over MySQL and PostgreSQL, but performance should not be one of them.

[1] http://www.mongodb.org/display/DOCS/Durability+and+Repair#Du...

[2] http://jira.mongodb.org/browse/SERVER-574

All the same limitations? Most SQL databases don't support anywhere near the amount of atomic operations that MongoDB provides. Also sharding MongoDB is a lot easier than sharding MySQL and PostgreSQL.
I'm a little confused by this; how do SQL databases not support more atomic operations that MongoDB, seeing as you have full control of the transactions themselves? Incrementing, adding to a list (generally done with inserting a row in another table), etc., are all standard operations.
Try upserting (update if exist, insert if not exist) a row, that doesn't work race-condition-free even in SERIALIZABLE isolation.
We used mongodb for a weekend hack/prototype cycling news site. It was an excuse to play with Mongo, not because we felt we'd need Mongo for this particular site. I am very glad we did use it, because I now understand some of the cool things you can do with Mongo.

Basically, if you are prototyping and have a few extra hours to spend playing, I would say go for it. Can't hurt to understand the tool, so you can pick it if it makes sense for what you need to do down the road.