I’ll allow it. Years later, I’m still miffed at Graylog (centralized logging engine) for having required MongoDB for a small bit of auth and meta storage that could’ve easily been done in MySQL or PostgreSQL (RDS even), forcing the need for that much more ops work for a small Mongo cluster for HA. Everyone deprecating the use of Mongo is a welcoming turn of events.
I shall recall these dark days to the next generation as “NoSQL Madness”, or more colloquially, “my schema is my app layer”.
I once worked at a place that, at some point in the past, had developed some kind of semantic graph database that never gained market traction.
The author of it had ended up as CTO and kept seeking out uses for his work and ended up finding all kinds of odd places for it to live, including as the auth database in a large scale document analytics system and running part of the payroll.
We were constantly running into all kinds of scalability issues where this 12 year old component was often the source of our pain, but he'd never even entertain a conversation to eliminate it and consolidate or replace it with something better.
I worked for a company that was old but had a fancy new product. It was just about to be released when....
A competitor bought us for stupid money. They fired most folks (not my small department) had their CTO decide between our fancy new product, and the product he created. There was no question, our product was amazing, his Frankenstein was two pieces of equipment cabled together in three items the footprint ... and still did less, and was quite a ways away from being "ready".
If there was a silver lining, the Frankenstein product doomed that company and we got bought by the far more competent competitor they had.
It's surprising how that works. I've been through numerous mergers, acquisitions, been in the company acquiring a company.
Lots of hair pulling and I can say that it was really hard to predict the outcome for any individual positive or negative in every case until long afterward.
I think I know the company. We were almost acquired by them until they hit the rocks and we took a hit. Interesting to hear how that database was a pain to manage.
I agree with the general sentiment just wanted to point out that "my schema is my app layer" has some valid use-cases. At my current job we deal with highly complex schemas (modelling insurance contracts). Correctness is paramount but at the same time you need flexibility in terms of change over time. Defining these schemas at the DB level would be painful, a bit like writing web apps in assembly. Languages like Haskell (which we use) help here with their rich and expressive typing capabilities, so we can model these complex domains expressively and use the DB only for persistence. Admittedly it does have downsides, like having to write your own migration layer, but for this use-case the benefits outweigh the pains.
PS.: We do use Postgres though ;) as it has 1st class json support and at the same time we have the luxury of using its relational capabilities where needed (think of a hybrid model).
My pet theory is that NoSQL took off purely because people were sick of having to manage schema changes.
Unfortunately, people reacted to being (justifiably) frustrated with schemas by throwing strict schemas out entirely, instead of making better schema management/migration tools.
also, DBAs hate developers. Developers want to make changes to the database to support their classes such as "i need to add a column" and the DBA response is "no." or, even worse, "fill out this ticket and it will get prioritized in the next scrum" meanwhile the developer is at a standstill.
I interviewed at southwest airlines years ago and i don't remember how it came up but we were talking about bottlenecks or something and i brought up the fact that having to go to a DBA to get a column added to a table, no matter how trivial, is a great source of delay. The whole room just nodded and looked at the floor, it was obviously painful for them.
NoSQL took the DBA out of the loop, now the developers were in full control of what was persisted and what wasn't. If they needed a new field they just made it so. On the flip side, DBAs got really freaked out and cried to whoever would listen.
In my experience you either have a DBA report to Developers or Developers report to a DBA. Never give them equal footing (even implied) because they'll just fight.
I think you need to also look at it from the DBAs stand point. If they did whatever the developers want and the system goes down or more likely other parts become slow, it is the DBA who gets the call.
In a large company like SW, the developer requesting some change for their app may have no idea how else the db is being used. What if their requested changes took down the db and prevented reservations from working?
My examples are extreme, but I have seen similar things in my years as both a developer and a DBA at times.
Took me a while to get back here but I do understand your point and it's totally valid. That door swings both ways.
That's why it's hard for the two camps to work side by side.
NoSql gave power to the devs at the expense of the experience and wisdom of the database folks. I bet many many applications and systems were completely screwed datawise more than once because of devs and NoSql.
Tell that to AWS. They've banned relational databases for specific workloads because Dynamo (nosql) provides more consistent performance, and is easier to operate.
Tons of conflation of Mongo's problems with those of nosql in this thread.
DynamoDB has had major improvements in the last few months: e.g. you get dynamic capacity provisioned tables (avoids re/write capacity exceeded exceptions because of capacity planning uncertainty), and transactions, to name two. However, even if you have a hosted RDBMS it has an implicit read and write capacity throughput that you need to design for (e.. hotspots in partitions), you just hit it a bit later in your project. The bounded latency at scale (throughput, and size of tables) is the main win for DynamoDB.
None of those OSS relational DBs offer Amazon lock-in the way DynamoDB does - it's more reliable income if someone uses it, but it also takes more convincing for people to use it. What enterprise would use it if Amazon themselves don't?.
Do you think all of the companies that chose Cassandra and Dynamo were wrong to do so? There's no use case for NoSQL? There were no lessons learned, value adds from NoSQL?
How do you explain the 'NewSQL' approach, which seems to be so clearly borne of what we've learned from NoSQL?
It should be obvious that NoSQL has value, regardless of the issues with one of the earlier NoSQL DBs.
I don't see a value other than fashion driven development, specially when comparing the bare bones browser GUI for Dynamo with something like SQL Server Management Studio or that whole story with primary and secondary indexes, with prices being set by index usage.
I think it also has to do with the source of data. If you receive data from a third party it’s easy to insert the whole document and figure out what parts you need later. If your data comes from your own client interface it makes more sense to build up the data model over time.
You could just plonk the data in a JSONB, BLOB or just plain old file on a disk with a URL pointing to it while you figure it out. And not introduce another super complex to support dependency...
Schema management and automated migration generation frameworks alleviate a lot of that headache. As long as the schema definitions are well structured and can be easily analyzed against a live db to find diffs and generate migration scripts. Django does this very well. You don't even need to use Django for the application, you can use it purely to define schemas and perform migrations on the DB. I'm sure there are alternatives for other languages.
People who got tired of dealing with schemas are now realizing that having zero schema is way more of a headache and way more work than the up front work of creating the schema.
> As long as the schema definitions are well structured and can be easily analyzed against a live db to find diffs and generate migration scripts. Django does this very well.
Setting up graylog was one of the worst mistakes I made. It took forever to get all the required software installed and configured and then it was taking up all the ram on the server doing fuck all.
I think "jvm is a hog" misattribution is actually part of the cause of "jvm being a hog". It implies that you don't need to worry about your memory management and efficiency in Java, because any hoggyness is the jvm's fault. With GC, you can just get away with having memory leaks everywhere and allocate millions of objects per second without any catastropic consequences. JVM software written with knowledge of memory management and that allocation isn't free can perform just as well as any other platform. As Bryan Cantrill loves to say in ever single talk: "gc is not your problem, allocation is your problem, GC just defers the cost".
My experience with those one-liner installs is that they usually work... strictly speaking. They don't scale, they don't deal with edge cases, they know nothing of your environment. They install one piece of software (in an "interesting" way that won't upgrade), and that's it.
MongoDB by default is (was? It’s been a while since I’ve used it) schemaless, which means all of your data validation must take place in your app instead of the database. Your data integrity is then only as good as your weakest validation.
I've always preferred the terms "schema on write" and "schema on read" to schemaful/schemaless.
At some point, you are always going to have to get the data into some sort of consistent model, so that you can operate on it in a predictable and sane way. So there's no question of there being a schema, even if it's only implicit. The question is, do you apply the schema once, when you write to the data store, so that the data at rest is consistently structured? Or do you allow it to be inconsistent in the storage layer, and instead apply the schema and re-validate the data every time you read from it?
There are valid reasons why one might choose either approach.
Which is not to say that valid reasons always play in to the decision to choose one approach or the other.
I'm tired and haven't often dealt with database systems. I'm struggling to see significant benefits for schema on read style systems - maybe progressive migration? I'm not convinced...
When you want to do validation depends on when you can do something about it. I work with a NO-SQL DB at work and while it wouldn't be my choice for most things I would use a DB for, the lack of validation has some benefits. A good example is where you have no ability to validate input from a user, but where you need to store the data anyway. The last thing you want is your noisy data being kicked out by the DB because it doesn't follow a DB constraint. Sometimes you want to go in afterwards and say, "Show me all the data which is incorrect". This is also useful for dealing with important data sent by other systems which have been coded by people other than you. The get the data wrong (or are using older versions of specs, etc) but you want to store what they sent you anyway. Then you can go in later and sort it out by hand.
I don't think that kind of thing is particularly common, but there are definite use cases. In our particular case we use it for financial data where we want the data we are given even if it is flawed. I think the OP is 100% correct. You have to write that validation somewhere or else you are in big trouble. Usually it is easier and more convenient to do it at the DB layer, but sometimes you choose to do it somewhere else.
I do a lot of work with both traditional RDBMSes and NoSQL databases.
The main question I would ask is: Is your data schemaless? Often it is - especially when storing what we'd normally call a "document". Heavily polymorphic data is often better stored schemaless. And sometimes you don't necessarily have the schema in advance (common when storing "other people's JSON").
You can store schemaless data in Postgres via the JSONB type, so this isn't necessarily a "Mongo vs Postgres" issue, but more of a general data modeling issue.
As a point of reference, the folks that struggle with schemaless tend to be the ones using Javascript, Ruby, or other type-ambiguous languages. Schemaless is less of a problem in Java and other languages where class structures enforce your schema.
Not having to know all / as many of the structural details up-front could be of value in some use-cases. It can translate to reducing time-to-start cutting code, which can (in some cases) be a business priority, and can lead to identifying critical dependency problems earlier in development.
I'd happily agree that's an inappropriate model in close to 99% of cases, and that even if it was the right model one could (and most likely should) still use a decent database for this anyway.
I can't speak to document stores very well, but one spot where schema-on-read makes sense is in data warehousing type applications. One of the potential troubles with the traditional ETL approach is that transforming the data to fit a fixed schema almost always involves some information loss that might make the data less suitable for answering certain questions.
That's fine if you can predict what questions your business intelligence or data science team will be asked ahead of time, but, realistically, you can't actually do that. Using a schema-on-read data warehouse instead is a more costly option, but also leaves you more able to respond to changing business demands.
One pretty cool use of schema on read is Splunk. It wants to take in all the data and let you search, transform and visualize it in a variety of ways some of which you may not know until you start exploring what data you have.
Right, but most people using it use model or data repository patterns to ensure correctness. It does offer nearly infinite flexibility provided you use it correctly. You can add fields without any sort of DB work, you just start adding fields to rows as needed and let it catch up organically.
There are use cases where mongo makes a lot of sense. It's very popular in the node.js / RAD world for sure. I certainly have never been a huge fan by any means. Only relatively recently did they solve distributed writes.
Unfortunately, I’d argue PostgreSQL gets you all the same benefits with JSON storage (fairly equivalent to Mongo docs), while also giving you all the goodness of a relational, transactional, schema enforcing RDBMS. PGSQL became Mongo faster than Mongo could become PGSQL.
Pretty much all of the things you'd want in a relational database are now present in Mongodb too.
The real benefit of MongoDB at this point is the ability to easily scale beyond a single machine with shards and high availability using replica sets.
Postgres will get you pretty far, but beyond a certain point the scaling story breaks down and you have to hack some sort of user space sharding solution. At that point all the schema update and backups become a nightmare.
On the contrary, RDBMS provides far more opportunity for validation, because it has all the data at its disposal, which can be queried as needed without the expense of crossing the boundary.
What is the purpose of validation would you say with modern computers? At one time, specifying the exact number of chars was good for squeezing out as much storage as possible, but less so today.
First, most 'noSQL' DB's (including Mongo) have data validations anyhow, rendering the discussion almost moot.
" RDBMS provides far more opportunity for validation"
This can't be true. The application layer, which ultimately contains all 'knowledge' of all aspects of the business, including data from all other resources, can obviously 'provide more opportunity' for validation than any DB possibly can.
Moreover, 'validation' generally implies aspects which are inherently application specific ergo, doing this purely in the data layer almost implies an intersection of concerns.
Validation in almost every case must be done on the app layer, so anything we get from the DB is an added benefit.
Also, data generally has to be validated when it enters into the business logic, long before it gets into the DB, moreover, there are usually data elements that are not persisted, and must be validated anyhow, again illustrating the requirement for validation above the DB.
"Applications generally can't recreate ACID properties" - why would they?
ACID and 'data validation' are generally separate issues.
Data generally has to be validated as it enters the business logic, before it gets stored in a DB. While a DB may in some cases ensure that data adheres to a schema, this usually does not fulfill all of the validation requirements.
Maybe I’m a rare exception, but I chose NoSQL early on when it was still “hot” and have never looked back. We’ve grown from a couple megabytes of data do several dozen terabytes and have had countless issues, but scaling our database was never one of them.
To be fair you can do use schema validators in Mongo. Not sure it's widespread in practice. And there are other distributed databases that aren't document stores that have schemas and various subsets of SQL implemented.
I shall recall these dark days to the next generation as “NoSQL Madness”, or more colloquially, “my schema is my app layer”.