Hacker News new | ask | show | jobs
by dccoolgai 3601 days ago
I agree... but at this point, it's tough to see with all the ink that's been spilled on these issues for years how you could think anything else. Maybe I read HN too much, but the manifold problems with MongoDB have been widely publicized for the past 6 years... it seems pretty close to conventional wisdom that you're going to have those problems if you decide to use Mongo.
3 comments

You have to separate between "Mongo the database" and "Mongo as it's used by companies"; the latter causes far more problems than the former.

I last used MongoDB seriously in 2012-2015. We had myriad operations problems including inconsistent indexing across shards (where some shards had an index created and others didn't, it was baffling), issues with the balancer not moving chunks properly, and more. Also it's just different than other DBs with its lack of transactional consistency (I think they've made progress on building this), but that's part of why it's fast.

However, the bigger problem is that document databases -- in general -- enable a kind of software development where the model sort of emerges over time, rather than being carefully designed from the beginning. Yes, it's flexible, but you pay an absolutely enormous cost down the line dealing with inconsistent documents. It's not like code where if you do something stupid, you can fix it over time with refactoring and "remodeling" -- data has mass. You can get into a situation where, with a large data set, it can take a week or more just to run the migration script required to scan an entire collection and rewrite a few billion documents into a new, better format.

There is no such thing as a "schemaless" database. That's like saying, oh sure, we just have a bunch of 1s and 0s in memory -- our data is "structureless". The question is whether the database enforces the schema, or not. And I think that in a lot of cases, it's a lot worse to have an "uncodified schema" than a rigid, but at least well-defined, one, that's consistent across the data at all points.

Sidenote: It's also occurred to me over the past few years that it's almost impossible to impose a consistent schema on a large enough dataset. If you truly are dealing with "big data" (TB/PB scale) maybe go straight to the document store of columnar because doing a migration is outright impossible, but don't be so quick to write it off for GB-scale datasets.

I should point out that for many organizations that have "big data" sitting somewhere, it usually is structured to begin with because it was collected by a repeatable process; or at the very least each piece of the whole (if it is a collection of stuff from different corners) has its' own internal consistency.

A challenge there is determining whether it makes sense to massage the data into a common schema for further analysis or to use an unstructured initial approach from the beginning. Sometimes you get to the former from the latter.

>>> Sidenote: It's also occurred to me over the past few years that it's almost impossible to impose a consistent schema on a large enough dataset

+1. I agree that NoSql promotes "careless db design" to an extent.

But yes (as an example) situations do arise where you either need to alter the existing SQL table with bazillions of rows of data in it or refactor your design in an ugly-ass way by adding a 2-column table tied to the 1st table by some FK.

I would say the latter and the former are inexorably intertwined; i.e. Mongo causes problems for companies because it does things you would not expect a database to do, like your example of phantom indexes...and pretty much all bets being off when you try to leverage sharding... the one thing it was supposed to be able to do to scale...
Couldn't agree more.

We had more problems with sharding over the years than you could imagine...the distributed locking mechanism didn't work a lot of the time, the balancer didn't work, weird consistency issues between the config servers, configuration that didn't get replicated across all shards, stupid shard key selection (admittedly our fault but there really should be better guidance on this topic), etc.

I agree with your points about typical database usage. Could it be that some of the people who reach for a database don't in fact know what a particular database implementation (like Mongo, or MySQL, or Redis) actually does, and they're just looking for a black-box that holds data at rest and occasionally gives it back out?
Yeah, maybe. But there's a bigger point here and that's that software engineering is starting to become more of a "there's a way we do things" craft profession (like medicine, law, or architecture) vs. a "everything is from first principles all the time" endeavor. There's just too much stuff to know. You can't expect people to have deep experience with more than, say, 2-3 databases every decade.

So we have to rely to some extent on the experience of others and our own intuitions / less than perfect inferences.

Because Mongo is wrongly marketed as "all-in-one" database.

Instead it should be marketed as "good database for dynamic data that requires complex filtering, also we have very good drivers for many languages" because that's the only thing it's good at.

But there are other databases that are good at that, too, and don't drop your data. RethinkDB? RavenDB? Depending on the type of data you're using either of those could be an option, and won't just drop your data randomly.
Yeah, drivers are not as good tho. That also counts, am I wrong?
I've had nothing but positive experience with RethinkDB drivers. RavenDB has good drivers for some languages, but admittedly not every language is well-supported.

Still, I can't see a situation where I would choose a non-working datastore over a working datastore.

I'm having an excellent experience coding against RethinkDB in Tornado Python. A shift in programming style from conventional callbacks to use Tornado's gen.coroutine is necessary. Having made that shift it's become very easy to use coroutines in the server to stream RethinkDB's JSON result sets up to an AngularJS front end. End to end JSON makes for zero impedance and rapid development. I'm constantly rejigging my schema, indexes and joins as I go so it feels like RDBMS based dev in a lot of ways.
The RethinkDB drivers are phenomenal in 4 to 5 languages.
Yeah. I remember thinking a few years ago that it tries to be three totally different products at the same time: a straight NoSQL k-v store like S3, a SQL-like OLTP relational thing ("rows are sorta like documents, right") and an offline analytics store with the map/reduce functionality.

It's like they wanted to be all three but couldn't quite commit to one, and the end result is the proverbial "tankicopter" that's neither as strong as a tank, nor as maneuverable as a helicopter.

I think it also might be okay as storage for dervied data, but then again, why not shove it into `uuid-jsonb` Postgres table, perhaps on the second server.
AFAIK quite a few businesses are stuck with it, as the cost of migrating to anything else would be prohibitive. It's usually easy to migrate logic from one language to another, it's harder to change a persistance layer, especially when it involves a paradigm shift such as NoSQL to SQL .