Hacker News new | ask | show | jobs
by ehwizard 5334 days ago
From CTO of 10gen

First, I tried to find any client of ours with a track record like this and have been unsuccessful. I personally have looked at every single customer case that’s every come in (there are about 1600 of them) and cannot match this story to any of them. I am confused as to the origin here, so answers cannot be complete in some cases.

Some comments below, but the most important thing I wanted to say is if you have an issue with MongoDB please reach out so that we can help. https://groups.google.com/group/mongodb-user is the support forum, or try the IRC channel.

> 1. MongoDB issues writes in unsafe ways by default in order to win benchmarks

The reason for this has absolutely nothing to do with benchmarks, and everything to do with the original API design and what we were trying to do with it. To be fair, the uses of MongoDB have shifted a great deal since then, so perhaps the defaults could change.

The philosophy is to give the driver and the user fine grained control over acknowledgement of write completions. Not all writes are created equal, and it makes sense to be able to check on writes in different ways. For example with replica sets, you can do things like “don’t acknowledge this write until its on nodes in at least 2 data centers.”

> 2. MongoDB can lose data in many startling ways

> 1. They just disappeared sometimes. Cause unknown.

There has never been a case of a record disappearing that we either have not been able to trace to a bug that was fixed immediately, or other environmental issues. If you can link to a case number, we can at least try to understand or explain what happened. Clearly a case like this would be incredibly serious, and if this did happen to you I hope you told us and if you did, we were able to understand and fix immediately.

> 2. Recovery on corrupt database was not successful, pre transaction log.

This is expected, repairing was generally meant for single servers, which itself is not recommended without journaling. If a secondary crashes without journaling, you should resync it from the primary. As an FYI, journaling is the default and almost always used in v2.0.

> 3. Replication between master and slave had gaps in the oplogs, causing slaves to be missing records the master had. Yes, there is no checksum, and yes, the replication status had the slaves current

Do you have the case number? I do not see a case where this happened, but if true would obviously be a critical bug.

> 4. Replication just stops sometimes, without error. Monitor > your replication status!

If you mean that an error condition can occur without issuing errors to a client, then yes, this is possible. If you want verification that replication is working at write time, you can do it with w=2 getLastError parameter.

> 3. MongoDB requires a global write lock to issue any write

> Under a write-heavy load, this will kill you. If you run a blog, you maybe don't care b/c your R:W ratio is so high.

The read/write lock is definitely an issue, but a lot of progress made and more to come. 2.0 introduced better yielding, reducing the scenarios where locks are held through slow IO operations. 2.2 will continue the yielding improvements and introduce finer grained concurrency.

> 4. MongoDB's sharding doesn't work that well under load

> Adding a shard under heavy load is a nightmare. Mongo either moves chunks between shards so quickly it DOSes the production traffic, or refuses to more chunks altogether.

Once a system is at or exceeding its capacity, moving data off is of course going to be hard. I talk about this in every single presentation I’ve ever given about sharding[0]: do no wait too long to add capacity. If you try to add capacity to a system at 100% utilization, it is not going to work.

> 5. mongos is unreliable

> The mongod/config server/mongos architecture is actually pretty reasonable and clever. Unfortunately, mongos is complete garbage. Under load, it crashed anywhere from every few hours to every few days. Restart supervision didn't always help b/c sometimes it would throw some assertion that would bail out a critical thread, but the process would stay running. Double fail.

I know of no such critical thread, can you send more details?

> 6. MongoDB actually once deleted the entire dataset

> MongoDB, 1.6, in replica set configuration, would sometimes determine the wrong node (often an empty node) was the freshest copy of the data available. It would then DELETE ALL THE DATA ON THE REPLICA (which may have been the 700GB of good data)

> They fixed this in 1.8, thank god.

Cannot find any relevant client issue, case nor commit. Can you please send something that we can look at?

> 7. Things were shipped that should have never been shipped

> Things with known, embarrassing bugs that could cause data problems were in "stable" releases--and often we weren't told about these issues until after they bit us, and then only b/c we had a super duper crazy platinum support contract with 10gen.

There is no crazy platinum contract and every issue we every find is put into the public jira. Every fix we make is public. Fixes have cases which are public. Without specifics, this is incredibly hard to discuss. When we do fix bugs we will try to get to users as fast as possible.

> 8. Replication was lackluster on busy servers

This simply sounds like a case of an overloaded server. I mentioned before, but if you want guaranteed replication, use w=2 form of getLastError.

> But, the real problem:

> 1. Don't lose data, be very deterministic with data

> 2. Employ practices to stay available

> 3. Multi-node scalability

> 4. Minimize latency at 99% and 95%

> 5. Raw req/s per resource

> 10gen's order seems to be, #5, then everything else in some order. #1 ain't in the top 3.

This is simply not true. Look at commits, look at what fixes we have made when. We have never shipped a release with a secret bug or anything remotely close to that and then secretly told certain clients. To be honest, if we were focused on raw req/s we would fix some of the code paths that waste a ton of cpu cycles. If we really cared about benchmark performance over anything else we would have dealt with the locking issues earlier so multi-threaded benchmarks would be better. (Even the most naive user benchmarks are usually multi-threaded.)

MongoDB is still a new product, there are definitely rough edges, and a seemingly infinite list of things to do.[1]

If you want to come talk to the MongoDB team, both our offices hold open office hours[2] where you can come and talk to the actual development teams. We try to be incredibly open, so please come and get to know us.

-Eliot

[0] http://www.10gen.com/presentations#speaker__eliot_horowitz [1] http://jira.mongodb.org/ [2] http://www.10gen.com/office-hours

7 comments

One addendum to Eliot's "both our offices hold open office hours"; we (10gen) also recently opened an office in London.

Although we don't yet have a fixed office hours schedule, we typically hold them every 2 weeks. The exact dates are announced via the local MongoDB Meetup Group°; we always hold the hours at "Look Mum No Hands" on Old Street.

At least one (and often several) of our Engineers make themselves available during this time to answer any questions and assist with MongoDB problems.

° http://www.meetup.com/London-MongoDB-User-Group

Great response. I'll take this over an anonymous, half-informed screed any day.
We've been using Mongo for almost a year now, and we've not seen any of the major issues such as data loss referred to. We've seen some of the growing pains of a quickly moving, dynamic platform, but nothing outside of the realm of what is reasonable for such a powerful solution. It's true that implementing sharding is no simple task, but with enough planning up front, you'll find yourself able to scale horizontally very quickly. After a couple of weeks of planning, we wound up making a few small changes in our codebase to migrate from master/slave to a sharded environment. Not a huge undertaking by any stretch, provided the current flexibility of our platform. Also, due to the fact that 10gen does make all bug information publicly available, we've managed to get it done with zero surprises.

Wedge Martin CTO Badgeville

Eliot, thanks for coming online and publishing your perspective.

MongoDB simply gets better in any version and it is indeed a reliable platform, at least as human beings (employees) are.

> If you want to come talk to the MongoDB team, both our offices hold open office hours[2] where you can come and talk to the actual development teams. We try to be incredibly open, so please come and get to know us.

I envy how all your (potential) customers are from California.

I've been to their open office hours in NYC and, though we don't have a support contract, they were incredibly welcoming and helpful.
Besides office hours in California NY and London we also have user groups in many cities http://www.10gen.com/user-groups and have (one day, very inexpensive) developer conferences frequently (next two in Dallas and Seattle).
We try to get as much face to face time with the community as possible. Check out 10gen.com/events and 10gen.com/user-groups.
Half the startups in NYC use mongo, but that might be cause they are connected to Union Sq Ventures
Or it might be because MongoDb really shines in the typical start-up use case...
Or at least better than MySQL for cases where not all data fits a perfect relational model?
Given the response, what are some best practices/gotchas for MongoDB then?

It might be helpful for 10gen to put together a short doc on what to watch out for evaluators.

Most of the best practices/gotchas can be found by reading the online documentation. Of all the replies Eliot gives they were either plainly obvious (oh, you have a system under heavy load and you're surprised that it gets worse when you give it another task to do?) or mentioned in the documentation. If you're planning on using something - especially for a production system - I sure hope you at least read all the available documentation.

I don't think a short doc is of any help for evaluators. You shouldn't be basing your decision on 400 words and some bullet points. If you're serious about your datastore then you should treat it seriously.

In addition to the documentation, videos from the conferences are a great place to start:

http://www.10gen.com/presentations/mongoboston-2011/schema-d... http://www.10gen.com/presentations/mongosf-2011/practical-sc...

When I was doing my research and came across a bunch of "Why not to use MongoDB" articles, I looked at alternatives solution to see if there was anything "better." Granted NoSQL is the new kid on the block but I wanted to see what my options were. Guess what I'm using, MongoDB. Why? Their documentation is fan-f'n-tastic. Their newsgroup support is just as good, lots of folks who help troubleshoot issue, including the developers themselves.
I was even gonna write a big blog post and say something similar to what you just said, but (of course) you said it better. Thank you.