Hacker News new | ask | show | jobs
by rogerbinns 4515 days ago
There is one thing MongoDB does spectacularly well - you can feed in arbitrary JSON and get the same JSON back out. (No need to define schemas or play any kind of system/db administrator.) Even the queries have the same "shape" as JSON, so no need for another arbitrary query language.

It will eventually bite you, and bite you hard. But you'll be well into the millions of records before that happens. Developers and products below that number will have very smooth sailing. And some live there permanently. One project I worked on years ago involved a music catalogue. Did you know there are only about 20 million songs?

The main problem is things get very painful as you get bigger, especially for writes. A doubling of write activity can lead to calamitous drops in performance. This is especially bizarre as the data model means they can easily have multiple concurrent writers. Heck having a lock per 2GB data file would quickly help with concurrency.

They have this same "single" approach in other places. For example building an index is single threaded. I did a restore the other day and then had to wait 8 days while it rebuilt indexes. One cpu was pegged but everything else was idle!

It also consumes huge amounts of space - at least double as the same data in JSON. There are known fixes https://jira.mongodb.org/browse/SERVER-164 https://jira.mongodb.org/browse/SERVER-863 (note how popular they are and how many years they have been open!)

I wish they would focus on making better use of the resources available - it should be possible to max out cpu, RAM and I/O.

We've ended up in the same situation as the article, figuring out where to migrate to with Cassandra being the front runner.

2 comments

It's telling that those jira tickets are both in the top five most voted on open server tickets and are from 2010 and before. (I know you know this roger, but) TokuMX completely resolves both of them.
But does TokuMX solve all the other issues like write concurrency, eye wateringly slow (and single threaded) index building, or there being no way to practically reduce the file sizes. (repairDB takes forever and requires the same amount of free space as already used, and compact also needs free space and doesn't remove any files).
(I work for Tokutek)

Write concurrecy: yes, TokuMX does not have a database level reader/writer lock.

Index Building: yes, fractal trees can write data much more efficiently, so if index building is a problem, I bet TokuMX solves it.

Practically reducing file size: to be honest, I am not sure because due to our great compression, this has not been a general issue for our users. Our reindex command could reduce file size, but I cannot point to examples.

One of our big goals is to address storage issues MongoDB has.

I don't know know much about TokuMX besides it being MVCC, which would solve the write concurrency issue nicely.
Yes. It does bite you quickly. As you add more models you start to duplicate a lot more information and by that time you'd think relational makes sense but you have to continue to use MongoDB. The option you got is either embed or reference. And still, there is no JOIN in mongo so you'd iterate many collections and do combine within your application code.

I think as PostgreSQL continue to improve its JSON data type people will look at SQL again even if they need to a basic model working. Because at the end working with constraints can help. Well, either side will bite you but one has to weigh...and sure that's a tough question.