MongoDB by default is (was? It’s been a while since I’ve used it) schemaless, which means all of your data validation must take place in your app instead of the database. Your data integrity is then only as good as your weakest validation.
I've always preferred the terms "schema on write" and "schema on read" to schemaful/schemaless.
At some point, you are always going to have to get the data into some sort of consistent model, so that you can operate on it in a predictable and sane way. So there's no question of there being a schema, even if it's only implicit. The question is, do you apply the schema once, when you write to the data store, so that the data at rest is consistently structured? Or do you allow it to be inconsistent in the storage layer, and instead apply the schema and re-validate the data every time you read from it?
There are valid reasons why one might choose either approach.
Which is not to say that valid reasons always play in to the decision to choose one approach or the other.
I'm tired and haven't often dealt with database systems. I'm struggling to see significant benefits for schema on read style systems - maybe progressive migration? I'm not convinced...
When you want to do validation depends on when you can do something about it. I work with a NO-SQL DB at work and while it wouldn't be my choice for most things I would use a DB for, the lack of validation has some benefits. A good example is where you have no ability to validate input from a user, but where you need to store the data anyway. The last thing you want is your noisy data being kicked out by the DB because it doesn't follow a DB constraint. Sometimes you want to go in afterwards and say, "Show me all the data which is incorrect". This is also useful for dealing with important data sent by other systems which have been coded by people other than you. The get the data wrong (or are using older versions of specs, etc) but you want to store what they sent you anyway. Then you can go in later and sort it out by hand.
I don't think that kind of thing is particularly common, but there are definite use cases. In our particular case we use it for financial data where we want the data we are given even if it is flawed. I think the OP is 100% correct. You have to write that validation somewhere or else you are in big trouble. Usually it is easier and more convenient to do it at the DB layer, but sometimes you choose to do it somewhere else.
It sounds more like an edge case though. I can't imagine all the data you need to store may or may not be the right format, so I wouldn't switch my database just because one or two entities need this.
Anyway this is 2019 so PostgreSQL JSONB fields have got you covered. You can even efficiently query the JSON objects within them.
I'll give a qualified "yes" to that. I agree there is no particular reason you can't use PostgreSQL. There are some advantages to the designs of some No SQL DBs if it fits your use case (immutable data, the ability to replicate easily). For our application eventual consistency was a really good fit. Also we wrote it 10 years ago :-) Even still, we often muse about replacing what we're using with PostgreSQL.
The main reason I wanted to reply to the question was that sometimes I see people who just can't get past not enforcing a schema at the DB layer for your whole data model. It really is crucial to understand that doing so means that bad data doesn't end up in your DB. This isn't always what you want. Like I said, not super common, but not unheard of either.
The underlying technology is pretty unimportant as long as you can do what you need to do. I've historically never really been a No SQL DB fan (there are very few downsides to relational data!!!) However, we've been using CouchDB for the odd thing and IMHO it has its place. Interestingly, I think it was my boss who originally selected it and he's gone very cold in that direction, where I've warmed to it while using it. I think the main thing is to understand exactly what benefit it is giving you (in our case easily replicated data with immutable change sets) and not give in to the hype of "OMG! You don't need a schema!", which is just not true. I've never asked him, but it is possible that my boss thought it would make life easier not to have to deal with schemas and DB migrations, and when it actually made things harder he got upset. I came into it knowing these things, but not really understanding the other benefits, which is why I warmed up to it.
If we were to start again, I think we would almost certainly go the PostgreSQL route, but I can see places where we would have some problems. It's probably a wash, really -- which is why we've not seriously tried to move away from CouchDB.
Unlogged jsonb tables in postgres have generally made nosql systems look pretty bad. I'm really happy the industry finally came up with vitesse so we could have a middle option between "My ACID database needs to scale writes so I'll roll my own fragile sharding layer" and "give up all attempts at schema and consistency and transactions".
Vitess is a really comfortable middle ground of fairly familiar database semantics within a partition.
The default design pattern for storing potentially invalid data with RDBMS is to (usually bulk) load the data in tables without constraints (loading tables), then do the validation in the database, and move valid records to their final tables.
I do a lot of work with both traditional RDBMSes and NoSQL databases.
The main question I would ask is: Is your data schemaless? Often it is - especially when storing what we'd normally call a "document". Heavily polymorphic data is often better stored schemaless. And sometimes you don't necessarily have the schema in advance (common when storing "other people's JSON").
You can store schemaless data in Postgres via the JSONB type, so this isn't necessarily a "Mongo vs Postgres" issue, but more of a general data modeling issue.
As a point of reference, the folks that struggle with schemaless tend to be the ones using Javascript, Ruby, or other type-ambiguous languages. Schemaless is less of a problem in Java and other languages where class structures enforce your schema.
Not having to know all / as many of the structural details up-front could be of value in some use-cases. It can translate to reducing time-to-start cutting code, which can (in some cases) be a business priority, and can lead to identifying critical dependency problems earlier in development.
I'd happily agree that's an inappropriate model in close to 99% of cases, and that even if it was the right model one could (and most likely should) still use a decent database for this anyway.
I can't speak to document stores very well, but one spot where schema-on-read makes sense is in data warehousing type applications. One of the potential troubles with the traditional ETL approach is that transforming the data to fit a fixed schema almost always involves some information loss that might make the data less suitable for answering certain questions.
That's fine if you can predict what questions your business intelligence or data science team will be asked ahead of time, but, realistically, you can't actually do that. Using a schema-on-read data warehouse instead is a more costly option, but also leaves you more able to respond to changing business demands.
One pretty cool use of schema on read is Splunk. It wants to take in all the data and let you search, transform and visualize it in a variety of ways some of which you may not know until you start exploring what data you have.
Right, but most people using it use model or data repository patterns to ensure correctness. It does offer nearly infinite flexibility provided you use it correctly. You can add fields without any sort of DB work, you just start adding fields to rows as needed and let it catch up organically.
There are use cases where mongo makes a lot of sense. It's very popular in the node.js / RAD world for sure. I certainly have never been a huge fan by any means. Only relatively recently did they solve distributed writes.
Unfortunately, I’d argue PostgreSQL gets you all the same benefits with JSON storage (fairly equivalent to Mongo docs), while also giving you all the goodness of a relational, transactional, schema enforcing RDBMS. PGSQL became Mongo faster than Mongo could become PGSQL.
Java did it way too slow, and that is a significant contributor to it being relegated to "legacy" in many areas. If it waited for the other languages to prototype stuff, it might have not been the case. The problem is that it waited for them to prototype it, refine it, release it, popularize it, and for their community to adopt it, before even starting to work on it in Java - which means that by the time they had it, most people who needed it were already elsewhere (not necessarily off JVM, just another language).
Lambdas were a very good example - if you look at the closest competitor, C#, it got the first take on them back in 2005. Then a major refinement in 2008, adding type inference. By 2010, lambdas were idiomatic in C#. Java, in contrast, released the first version in 2014. And even then, they're still less powerful.
Looking around the office and my phone, Java is anything but "legacy".
As for alternative JVM languages, while they are cool and have brought many fresh ideas into the platform, they remain a very tiny portion of the Java developers' market.
Java takes a very long time, because backwards compatibility and cooperation among giant companies takes years.
C# has basically Microsoft deciding how the roadmap looks like and rebooting the platform multiple times.
C# 8 won't even be fully supported on the .NET Framework.
A lot of useful features don't end up being implemented in Java and even if they do, libraries and frameworks have to update to take advantage of them. With a new language libraries are built from the ground up based on the new features.
For example adding async to the language isn't necessarily going to change your programs to be async because every widely used library has already adopted threads and doesn't support async yet and often never will.
Pretty much all of the things you'd want in a relational database are now present in Mongodb too.
The real benefit of MongoDB at this point is the ability to easily scale beyond a single machine with shards and high availability using replica sets.
Postgres will get you pretty far, but beyond a certain point the scaling story breaks down and you have to hack some sort of user space sharding solution. At that point all the schema update and backups become a nightmare.
> Pretty much all of the things you'd want in a relational database are now present in Mongodb too.
On the "relational" front, it has denormalized-only schema validation and limits transactions to replica sets (so no transactions with sharding), while also recommending single-document transactions over multi-document transaction via denormalization. (FWIW, transactions aren't available in any open source release).
On the "database" front, it has a history of misleading users and remorselessly dropping data.
> Postgres will get you pretty far
Postgres is an actual relational database, open source, battle-proven with a good design and a great team behind it. It never claimed to be good at, let alone capable of, doing things it could not actually do (well, or at all).
> but beyond a certain point the scaling story breaks down and you have to hack some sort of user space sharding solution.
Scaling IS a hard thing, and presents itself quite differently to different use-cases. Nevertheless, horizontally scaling Postgres — for when one truly hits the petabyte-scale — is a problem that has been solved correctly many times before (out of core). For a similar-to-MongoDB method, check out Citus, with the assurance that it only adds to an actual database; as opposed to the MongoDB way of doing things: make up and promise magic scaling solutions that "Just Work", then try to build a database on top of it.
On the contrary, RDBMS provides far more opportunity for validation, because it has all the data at its disposal, which can be queried as needed without the expense of crossing the boundary.
What is the purpose of validation would you say with modern computers? At one time, specifying the exact number of chars was good for squeezing out as much storage as possible, but less so today.
Its still absolutely critical for almost everything.
Some use cases:
* My code depends on this value always existing so make this not null
* My code is doing math on this value so make sure it is always a number
* This record belongs to other record, make sure other record can not be deleted while this one still exists
Modern computers change next to nothing with the need to validate data. The worlds fastest computer wont tell you how to add a number that doesn't exist.
First, most 'noSQL' DB's (including Mongo) have data validations anyhow, rendering the discussion almost moot.
" RDBMS provides far more opportunity for validation"
This can't be true. The application layer, which ultimately contains all 'knowledge' of all aspects of the business, including data from all other resources, can obviously 'provide more opportunity' for validation than any DB possibly can.
Moreover, 'validation' generally implies aspects which are inherently application specific ergo, doing this purely in the data layer almost implies an intersection of concerns.
Validation in almost every case must be done on the app layer, so anything we get from the DB is an added benefit.
Also, data generally has to be validated when it enters into the business logic, long before it gets into the DB, moreover, there are usually data elements that are not persisted, and must be validated anyhow, again illustrating the requirement for validation above the DB.
It's extremely common to validate a piece of data not on its own, but how it relates to other pieces of stored data. Without transactional semantics, an application basically can't enforce these invariants w/ any reliability (or those semantics need to be ensured out of band, or w/ little data modeling tricks that tend not to scale well).
There certainly are invariants that are non-trivial or cumbersome to enforce strictly with a schema, but you can really only enforce them w/ a database that provides serializable transactions.
In many cases, schematization of data in the database is good for other reasons though (for instance, guaranteeing type-normalized data in the presence of multiple deployed versions of an app via accident or otherwise, ensuring your queries and updates are typesafe, etc.)
First of all, we were talking about validation of data in the database, specifically.
> 'validation' generally implies aspects which are inherently application specific
Not at all. Taking this at face value implies that some app can write data to the database that is valid according to that app, and then another app can read data that is invalid from its perspective, and have to deal with it. That doesn't make sense - data is data, it's either valid, or it's not. That's why the schema is about the data, not about the app.
> Validation in almost every case must be done on the app layer
For UX reasons, mostly, yes. But it's usually much more basic than what e.g. triggers would do in the DB itself.
I'm not saying that there's nothing to validate outside of the DB, either. But for the data that is in the DB, the DB itself can usually do a better job.
The general idea I've seen (for both SQL and NoSQL databases) is two apps never should write to the same database to ensure separation of concerns. Some API layer instead handles all write operations.
Disclaimer: MongoDB employee. All opinions are my own.
"data is data, it's either valid, or it's not. That's why the schema is about the data, not about the app."
This is not true.
The objective of the overall app/system (i.e. front/back/middle/DB/storage/services etc.) is to carry out some kind of business logic. A DB schema cannot fully validate stored data against the logic.
Otherwise we wouldn't write backend code, we'd just write a bunch of schemas and be done with it.
Let's use a crude example: a password. (Of course, we would never in reality store a password as a string in the clear, but just as an example ...). When a user sets a new password, we have to validate that it meets specific requirements in terms of format, and then some others rules which are more complicated such as: "can't be the same password as the last 5".
Those 'password rules', for example, cannot be encapsulated in the schema of the DB and yet must be applied in order for the data to be 'valid' from the perspective of the app, or 'overall system'.
The DB may only care that it's UTF and max 20 chars. But the system requires more validation than that.
Re: Your statement about 'one app writing data, and the other app not knowing what to do with it'. This is not true, because all apps operating on such data must understand it data in the context of business/logic context in which it was designed. Even 3rd party users of such data, via API's, must understand this data from the level of business logic - not merely 'schema validation'.
When you query data from Google Geolocation, the 'city' field may be a valid string of a certain length, but that's not very useful: it must actually be the name of a city! Any 'app' using this data must operate with the explicit understanding that this is in fact the name of a city - and not just a string that met a DB schema validation requirement.
> The application layer, which ultimately contains all 'knowledge' of all aspects of the business,
Data always outlives the application. You could argue that some app + data lives on together, but then you have just poorly reimplemented what an RDBMS does for you up front.
"Applications generally can't recreate ACID properties" - why would they?
ACID and 'data validation' are generally separate issues.
Data generally has to be validated as it enters the business logic, before it gets stored in a DB. While a DB may in some cases ensure that data adheres to a schema, this usually does not fulfill all of the validation requirements.
Validation often requires examine a model beyond "is this an int?". That model needs to be self-consistent. That requires atomic movements from consistent state to consistent state.
You can do that yourself. Or let the database do it. For things where you can't express it in a database schema, sure. But you'd be surprised how far it gets you.
Edit: scheme/schema autocorrect typos corrected. Thanks!