Hacker News new | ask | show | jobs
MongoDB: How to Use the JSON Schema Validator (percona.com)
31 points by sedlav 2869 days ago
1 comments

I've never really gotten the appeal of "schemaless" databases. Your data has a schema, whether you're explicit about it or not, but being explicit allows you (and forces you) to reason about it at the right time at the right level of abstraction.
Schemaless is great for quickly bashing a hobby project (or PoC) together. Not so much for larger scale for the reasons you gave.
A big problem is MongoDB isn't schemaless, it is document oriented.

That is a really big schema choice imposed on developers and the root cause of for instance why it took years to be able to create good indexes on inner fields, and they still can't do it efficiently.

The same can be said for simple key value stores.

True schemaless databases are much rarer and provide developers orders of magnitude increases in terms of flexibility, expression and speed.

Unfortunately these kinds of databases are not readily available to a broad audience. It's also the case that for the majority of jobs and developers SQL based stores are still a very good choice unless you need to deal with huge amounts of data from disparate sources without bringing down the system.

Schemaless databases allows you to model your data in a way that is more natural to your domain. The table-relational model used by SQL databases isn't always suitable for all types of data, most notably trees and graphs.

Don't get me wrong, I'd probably pick a traditional RDBMS over MongoDB for most projects, but the object-relational impedance mismatch is a real problem.

True, but all those migration scripts/patches/table locks add complexity and inconvenience (less agile).
> True, but all those migration scripts/patches/table locks add complexity and inconvenience (less agile).

I don't think so. You could say exactly the same thing about maintaining tests for your code. Explicit schemas and constraints (and the effort that goes along with maintaining them) are very much like having tests for your data. They both help ensure that your code actually works when it needs to.

Last time I worked with PostgreSQL, it would lock entire table while adding a new column. Imagine users table, then you have to schedule this kind of altering outside business hours... which impacts devs (they have to stay late and/or come early) and business potentially, it has to wait till next day to have a feature delivered.

All these are trade-offs and everyone decides what is more important for them.

> it would lock entire table while adding a new column

It does so for the entire << 1ms it takes to add the necessary metadata.

If that's the case, then one less problem to deal with, however I remember it was a matter of several minutes.
you should interpret "schemaless" as "free/non-fixed schema".

> Your data has a schema, whether you're explicit about it or not,

Partially true, but not accurate. Often the items in a data set have similar schemas, but not exact the same one.

good luck writing code for that kind of dataset then.

Data has schema by definition otherwise you wouldn't be able to reason about it.

Certainly, but that doesn't mean the schema has to be a strict validation encoded into your storage format. It's a perfectly well-defined programming model to say "well, I'm reading query X with schema Y, and if some rows don't match Y give me nulls instead".
"well, I'm reading query X with schema Y, and if some rows don't match Y give me nulls instead"

Seems like a recipe for disaster to me, but well... A database isn't a "storage format". It's most often the single source of truth for a set of information.

Not being fully sure what data you expect from that source of truth and yet being able to query it is really dangerous. What if you start to update this data after having nullified things you didn't understand ?

Schemaless databases are good for scenarios where the database isn't a source of truth. If you have a table full of e.g. per-second heartbeats from a bunch of deployed services, there's no fundamental underlying truth anyone's trying to gather from it, and you can't afford to run a full schema migration every time someone adds a new metric.

I recognize some people do try to use schemaless databases in the way you're describing, and I agree that's weird and dangerous.