Hacker News new | ask | show | jobs
by eldavido 3601 days ago
You have to separate between "Mongo the database" and "Mongo as it's used by companies"; the latter causes far more problems than the former.

I last used MongoDB seriously in 2012-2015. We had myriad operations problems including inconsistent indexing across shards (where some shards had an index created and others didn't, it was baffling), issues with the balancer not moving chunks properly, and more. Also it's just different than other DBs with its lack of transactional consistency (I think they've made progress on building this), but that's part of why it's fast.

However, the bigger problem is that document databases -- in general -- enable a kind of software development where the model sort of emerges over time, rather than being carefully designed from the beginning. Yes, it's flexible, but you pay an absolutely enormous cost down the line dealing with inconsistent documents. It's not like code where if you do something stupid, you can fix it over time with refactoring and "remodeling" -- data has mass. You can get into a situation where, with a large data set, it can take a week or more just to run the migration script required to scan an entire collection and rewrite a few billion documents into a new, better format.

There is no such thing as a "schemaless" database. That's like saying, oh sure, we just have a bunch of 1s and 0s in memory -- our data is "structureless". The question is whether the database enforces the schema, or not. And I think that in a lot of cases, it's a lot worse to have an "uncodified schema" than a rigid, but at least well-defined, one, that's consistent across the data at all points.

Sidenote: It's also occurred to me over the past few years that it's almost impossible to impose a consistent schema on a large enough dataset. If you truly are dealing with "big data" (TB/PB scale) maybe go straight to the document store of columnar because doing a migration is outright impossible, but don't be so quick to write it off for GB-scale datasets.

4 comments

I should point out that for many organizations that have "big data" sitting somewhere, it usually is structured to begin with because it was collected by a repeatable process; or at the very least each piece of the whole (if it is a collection of stuff from different corners) has its' own internal consistency.

A challenge there is determining whether it makes sense to massage the data into a common schema for further analysis or to use an unstructured initial approach from the beginning. Sometimes you get to the former from the latter.

>>> Sidenote: It's also occurred to me over the past few years that it's almost impossible to impose a consistent schema on a large enough dataset

+1. I agree that NoSql promotes "careless db design" to an extent.

But yes (as an example) situations do arise where you either need to alter the existing SQL table with bazillions of rows of data in it or refactor your design in an ugly-ass way by adding a 2-column table tied to the 1st table by some FK.

I would say the latter and the former are inexorably intertwined; i.e. Mongo causes problems for companies because it does things you would not expect a database to do, like your example of phantom indexes...and pretty much all bets being off when you try to leverage sharding... the one thing it was supposed to be able to do to scale...
Couldn't agree more.

We had more problems with sharding over the years than you could imagine...the distributed locking mechanism didn't work a lot of the time, the balancer didn't work, weird consistency issues between the config servers, configuration that didn't get replicated across all shards, stupid shard key selection (admittedly our fault but there really should be better guidance on this topic), etc.

I agree with your points about typical database usage. Could it be that some of the people who reach for a database don't in fact know what a particular database implementation (like Mongo, or MySQL, or Redis) actually does, and they're just looking for a black-box that holds data at rest and occasionally gives it back out?
Yeah, maybe. But there's a bigger point here and that's that software engineering is starting to become more of a "there's a way we do things" craft profession (like medicine, law, or architecture) vs. a "everything is from first principles all the time" endeavor. There's just too much stuff to know. You can't expect people to have deep experience with more than, say, 2-3 databases every decade.

So we have to rely to some extent on the experience of others and our own intuitions / less than perfect inferences.