Hacker News new | ask | show | jobs
by metafunctor 2142 days ago
In projects I've been involved with, storing JSON in the database has often turned out to be a mistake.

Django models create a well-defined self-documenting structure for your schema, are easy to evolve using migrations, and there's a wealth of tooling built on top. IMHO, these far outweigh the perceived convenience of simply storing some stuff in a JSON field.

If you find yourself implementing your own validation and schema code for JSON fields, I'd say it's a sign that you should probably stop and migrate the data to Django models instead.

There are some cases when storing JSON is fine, of course, but in my experience they are few and far between.

5 comments

> If you find yourself implementing your own validation and schema code for JSON fields, I'd say it's a sign that you should probably stop and migrate the data to Django models instead.

Don't agree. Where JSON fields with validators are useful is when you would otherwise have inherited models for different variants of the same type of thing. It's a lot easier to understand and query one model with 50 Cerberus (or whatever) schema validators than it is to understand and query 50 Django models. With the JSON approach it leaves one part of the codebase that's complicated to understand, but you mostly shouldn't need to understand it. Whereas when you have tons of different models and each one needs to be used in different places and is queried in different ways and has its own serializers, the mess tends to spread throughout the entire codebase.

I am skeptical (with an open mind) of the JSONField pattern. We’ve used it in a few places where you have something like inheritance, so you want to handle all Event objects the same at the top level, and then dispatch differently based on Type. This works fine.

My concern is that this is a pattern which experts advocate, but less precise/experienced engineers are likely to break when it comes time for a data migration.

With normal SQL fields, if you change the schema you need to create a migration, and that gives an easy point to check for datamigrations too. If any commit can silently change the write-schema, it’s a lot harder to police.

So I suspect if your team skews towards very experienced engineers the JSONField pattern probably gains in value. If that’s right it’s probably best to caveat the recommendation.

Interested to know your experiences with schema changes though.

I’m also interested to note that you are pointing out a general issue with Django I’ve experienced - the Models spread through the whole system. You can solve that in other ways, say by having a repository layer that used normal (non-JSONField) models and maps them through to a POPO. I’m not sure you _need_ the JSONField pattern to get the separation you’re looking for.

> If any commit can silently change the write-schema, it’s a lot harder to police.

In addition to schema validators, you really also need a data integrity async task that runs once a day to catch these kinds of issues. They're not that hard to fix if you catch them early, it's if you realize there's a data integrity issue years later and the original team isn't even there anymore that you get bigger problems. If you have a lot of data, a reasonable solution is just having it run on everything from the last 24 hours plus a random sampling of data older than that.

> Interested to know your experiences with schema changes though.

Haven't done a ton, so admittedly there could be complex cases I'm not seeing. I think it helps though, in cases like the example I gave of using it for form-like things, to set it up like:

  QuizModel
  QuizQuestionModel (FK to QuizModel)

  QuizResponseModel (FKs to QuizModel, UserModel)
  QuizQuestionAnswerModel (FKs to QuizResponseModel, QuizQuestionModel)
This way you have four models rather than an indefinite number. But because it's not just one big model with lists of JSON objects for questions and responses, this makes running migrations and data integrity checks much simpler to write and easier to reason about.

The idea being also that you have JSON schema validators for the different types of JSON blobs that get stored in QuizQuestionModel and QuizQuestionAnswerModel.

Like I said, there are some cases where using JSON fields makes sense. Good examples would be cases where you wish to retain the structure of the data, but don't have control over it. Sounds like you're describing a case like that.
I’ve used it more for generating forms, e.g. for when there are different types of questions or modules can ask or show the user —- multiple choice, free response, upload an image, watch a video, etc. So I’ve used JSON here both for storing the questions and responses, and then have used validators for each question type and response type. The downside is that it’s hard to do migrations. But in most cases you can just save the rendered HTML each user is shown and that covers you for having an audit trail, rather than maintaining backwards compatibility indefinitely. And if you do need to maintain compatibility forever, the schema validators enable that.

Having done or seen almost the exact same use for JSON at multiple startups, for whatever downsides there are I think it’s 10x better than needing to try to understand a bunch of models, each with their own queries and logic.

100% agreed, JSON fields are good when you want to dump a bunch of data you don't care about the shape of, but if you need a schema, pull the data into actual fields.
For a database like postgres, or even for something like SQLite, this mostly becomes a distinction without a whole lot of difference, since the database can index and access JSON structures just like regular columns.
JSONB columns are incredibly useful if you have data that doesn't fit well into a strict relational schema. And they are very powerful in terms of query ability in Postgres, but still quite far from a conventional relational schema.

Queries that act on the insides of a JSONB column are much harder to write than the equivalent conventional queries. They are also much, much slower in certain cases as the DB has to read the entire JSONB blob in many cases, and because they don't have proper statistics. The performance can range from slightly slower to completely pathological query plans that take ages.

I'm a big fan of JSONB in Postgres, but it is no replacement for a relational schema, there are far too many problems you inflict on yourself if you try to use it like that.

Access, but not constrain or type-check.

And it's just harder to use, I'm with it being generally a 'mistake' (in the 'you will regret this later' sense) - it's better than text, but I'm in the camp of as much structure as possible ('you will thank yourself later').

> Access, but not constrain or type-check.

You can create CHECK constraints involving JSON operations in postgres, which is the same way you'd enforce any constraint.

Manually, right? Django autogenerates all of the SQL schema for you on non-JSON fields. You could replicate that work, but that sounds tedious.
As others have mentioned, it definitely does have downsides when it comes to writing queries and especially when using frameworks/libraries to interface with the db.

But for the right job, JSONB columns are awesome!

What’s a real world example where you would choose to index the JSON as you said rather than putting them into the fields/schema.
Any time you'd otherwise use EAV.
EAV is considered an antipattern. While wouldn't rule out using it, it probably means it should be avoided if there is another solution.
I've been working on a project using schema.org objects (100's of types of objects) in postgres in a graph/tree like structure. While structured the objects are not consistent enough for me to want separate schemas for them. Using JSONb in postgres allows me to index separate fields within the JSON while still having all the objects in a single graph.

If I couldn't use JSONb with indexes (and triggers for custom foreign key checks) it'd be much harder to do this in in a relational database (and the data is relational, it just isn't consistent across all types).

Perhaps this would be better in a Triplestore but there are a lot of features of postgres that are not available in them.

I think there was a bit of a confusion on the implied use case of our project. As another commenter pointed out, when the data is user supplied and is not always uniform in nature, creating migrations for each adjustment to the custom fields they can modify becomes virtually impossible. I ended up settling for JSON fields because I don't have to migrate the data or the database schema; I can just do a best-guess conversion for the data when it hits the UI, and supply some validation warnings if it really doesn't make sense (e.g. a non-date string somehow makes it into a "date" field).
>>In projects I've been involved with, storing JSON in the database has often turned out to be a mistake.

I mostly agree. The only JSON I store in the database is UI settings, because it's easy enough to parse it with JavaScript (and fall back in default if the JSON doesn't contain that prop). It also decouples your database schema from the UI, the latter of which can evolve much more rapidly.

For everything else, having a schema is better.

Can you please elaborate on the "wealth of tooling" part? Makes me wonder what I don't know...
This is the best comment I've read on HN in a year, thank you. Some great stuff in here I didn't know existed, will save me a ton of time and frustration in the future.
Also the ModelAdmin won’t be friendly for JSON data.