Hacker News new | ask | show | jobs
by josephg 3426 days ago
Right out of the box? Mongodb has been trying to get it right for 10 years now. Kyle says the storage engine they've used for most of that lifetime is fundamentally flawed, and they've only now, a decade on, managed to write something without known bugs to replace it. And maybe this time it's ok. Maybe this time there aren't any more layers of buggy crap in mongo yet to be found and fixed.

Maybe. But you'd have lost that bet if you made it any day in the last 10 years. And in those 10 years mongodb has demonstrated again and again that they aren't up to the task of writing a reliable database. Even with their new storage engine they couldn't find the bugs alone.

I think using mongo today for any mission critical data is an irresponsible choice. I'd seriously question the judgement of any senior engineer who picks it for a new project over rethinkdb or Postgres.

3 comments

>"Kyle says the storage engine they've used for most of that lifetime is fundamentally flawed, and they've only now, a decade on, managed to write something without known bugs to replace it"

Didn't WiredTiger Inc write the new WiredTiger storage engine before they were acquired by MongoDB Inc?

https://gigaom.com/2014/12/16/mongodb-snaps-up-wiredtiger-as...

Do you think MongoDB is a good choice (given how easy it is to use) when you only care that 99.999% of your data that you insert should end up in the database? That's my use case. Best-effort integrity. I mostly just want a DB can insert and query fast for documents and am not really concerned if I lose a few documents here and there.
Why wouldn't you just use anything else that can manage to insert/read data without losing it?

I don't really understand the angle of "can I get away with it anyways, tho?"

Some of us are already using MongoDB and are not so keen on replacing it.
If you read back the discussion was scoped to "new projects". By jospehg:

> I'd seriously question the judgement of any senior engineer who picks it for a new project over rethinkdb or Postgres.

It's about making tradeoffs. If MongoDB works for you (I actually enjoy using it tremendously) then I have to ask myself am I ok with its non-perfect integrity. For my use cases this isn't a problem. I'm not working with customer data or anything where losing a few records would make any difference at all.
In my experience, mongo lets you check the end result and try inserting again.
How do you expect to check the end result? The article's Jepsen analysis shows that both the v0 and v1 replication protocols (excepting the very latest version of v1 that appears to be in response to this) can result in acknowledged writes being lost. I.e., the DB tells you, for a write sent with a majority, that the write was successful — to a majority! Subsequently (and, if I understand the article, possibly not immediately), the write can be lost.
It depends.

Given a small cluster of reliable nodes on a reliable network, these errors will occur extremely rarely. So rarely, in fact, that they'll be written off as "user error" by support.

If you're a startup building a system which has to quickly and reliably scale from 3 > 3000 nodes in a year then the whole thing is likely to explode in your face. Twitter style.

Now, if MongoDB was so superior that it was truly platform which would even enable that kind of scaling, then the decision is simple: just go for it.

The thing is, this isn't how the world works. When systems are built, very few people consider (or are capable of considering) the growth of the system. Frameworks and database are, by the rule, chosen arbitrarily. When scaling happens, the question is more "how can we scale what we have whilst having everything kind of work" than "how do we design a system which works correctly at scale".

Mongo's whole strategy is based around this. Make Mongo the default choice for the current generation of developers.

Fantastic market strategy.

Fantastic market strategy, but it's still snake oil they're selling.

When you talk about growing, the biggest value in Open Source has been that you can start with something free but shit, and then as you make money then you can spend it on customizing that Open Source in a way that benefits you.

However there exist commercial offerings that are (and were) faster and better at MongoDB than MongoDB was: KDB could've handled Twitter, we never would've seen a fail whale, and it is a whole hell of a lot cheaper than the developers and the customizers, and the headache, and the fact that you're making something open source which ultimately benefits your competition.

Another way to think about it is by thinking about experts: If you've got a great startup idea, why would you want to make your odds 10% worse by introducing the possibility it'll fail, by using the cheapest hacky hack thing that has 10% chance of losing your data? Ask experts with data, and be honest with your budget and you'll do a lot better.

I have some actual experience with KDB and MongoDB so I'm going to have to call bullshit.

How does KDB handle replication and failover? Or even high insert/update rates to datasets that exceed the size of memory? How do you shard KDB?

KDB doesn't support unicode text. Do you plan to only have English speaking users?

Yes, KDB excels at its relatively well defined niche of transforming and aggregating "smallish" (say 10 TB or less) numerical time series data. It would be a horrible choice for the backing store of a high throughput CRUD application...

What is it with KDB zealots thinking that KDB is the best database for every task? I swear, KDB is the Scientology of databases.

Well, thanks for the question!

You check the result with getLastError which, as you described, can be used to ensure a majority agrees with the write. But you normally don't use getLastError that way. Because a majority might not even be concerned with that particular write. They are, after all, shards. Instead you check if primary got the write. If primary disconnects while you are checking, you catch the exception and try checking until a new primary is decided. And if your check result is not ok, you try inserting again. That's as reliable as it gets when inserting to any database including SQL databases that support transactions.

You describe it like it is simple but that is ridiculous number of steps to simply check your data was actually written to the database.

>that's as reliable as it gets when inserting into any database including SQL

The difference being in a SQL database you call commit and all this happens for you automatically

>>You describe it like it is simple

ah, no. I did not.

> I'd seriously question the judgement of any senior engineer who picks it for a new project over rethinkdb or Postgres.

... you mean RethinkDB, whose future is still uncertain? Regardless of technical merits, the currently unstable future of RethinkDB means a senior engineer should be extremely cautious about selecting it for a significant project.

To be fair, choosing a scalable database even for a senior engineer still requires quite a bit of very specialized knowledge in distributed systems that most simply don't have. So they have to rely on what "feels right" anyway, rather than making an engineering decision, and are very susceptible to all the marketing and PR and authoritative opinions. There are no right choices for them. Although if in doubt everyone should probably default to a dynamo-style db, as it forces you to think about and organize your data in a certain future-proof way, which actually excludes all of the mentioned databases.