Hacker News new | ask | show | jobs
by grenoire 2143 days ago
JSONField has been an absolute godsend in combination with Django's ORM. I had been using it with Postgres and will likely keep our backend the same, but I cannot recommend it enough. You will have to write some validation and schema code on top if you want your data to have similar (but weaker) guarantees to the usual typed fields; the benefits from the flexibility you get are immeasurable though.
7 comments

In projects I've been involved with, storing JSON in the database has often turned out to be a mistake.

Django models create a well-defined self-documenting structure for your schema, are easy to evolve using migrations, and there's a wealth of tooling built on top. IMHO, these far outweigh the perceived convenience of simply storing some stuff in a JSON field.

If you find yourself implementing your own validation and schema code for JSON fields, I'd say it's a sign that you should probably stop and migrate the data to Django models instead.

There are some cases when storing JSON is fine, of course, but in my experience they are few and far between.

> If you find yourself implementing your own validation and schema code for JSON fields, I'd say it's a sign that you should probably stop and migrate the data to Django models instead.

Don't agree. Where JSON fields with validators are useful is when you would otherwise have inherited models for different variants of the same type of thing. It's a lot easier to understand and query one model with 50 Cerberus (or whatever) schema validators than it is to understand and query 50 Django models. With the JSON approach it leaves one part of the codebase that's complicated to understand, but you mostly shouldn't need to understand it. Whereas when you have tons of different models and each one needs to be used in different places and is queried in different ways and has its own serializers, the mess tends to spread throughout the entire codebase.

I am skeptical (with an open mind) of the JSONField pattern. We’ve used it in a few places where you have something like inheritance, so you want to handle all Event objects the same at the top level, and then dispatch differently based on Type. This works fine.

My concern is that this is a pattern which experts advocate, but less precise/experienced engineers are likely to break when it comes time for a data migration.

With normal SQL fields, if you change the schema you need to create a migration, and that gives an easy point to check for datamigrations too. If any commit can silently change the write-schema, it’s a lot harder to police.

So I suspect if your team skews towards very experienced engineers the JSONField pattern probably gains in value. If that’s right it’s probably best to caveat the recommendation.

Interested to know your experiences with schema changes though.

I’m also interested to note that you are pointing out a general issue with Django I’ve experienced - the Models spread through the whole system. You can solve that in other ways, say by having a repository layer that used normal (non-JSONField) models and maps them through to a POPO. I’m not sure you _need_ the JSONField pattern to get the separation you’re looking for.

> If any commit can silently change the write-schema, it’s a lot harder to police.

In addition to schema validators, you really also need a data integrity async task that runs once a day to catch these kinds of issues. They're not that hard to fix if you catch them early, it's if you realize there's a data integrity issue years later and the original team isn't even there anymore that you get bigger problems. If you have a lot of data, a reasonable solution is just having it run on everything from the last 24 hours plus a random sampling of data older than that.

> Interested to know your experiences with schema changes though.

Haven't done a ton, so admittedly there could be complex cases I'm not seeing. I think it helps though, in cases like the example I gave of using it for form-like things, to set it up like:

  QuizModel
  QuizQuestionModel (FK to QuizModel)

  QuizResponseModel (FKs to QuizModel, UserModel)
  QuizQuestionAnswerModel (FKs to QuizResponseModel, QuizQuestionModel)
This way you have four models rather than an indefinite number. But because it's not just one big model with lists of JSON objects for questions and responses, this makes running migrations and data integrity checks much simpler to write and easier to reason about.

The idea being also that you have JSON schema validators for the different types of JSON blobs that get stored in QuizQuestionModel and QuizQuestionAnswerModel.

Like I said, there are some cases where using JSON fields makes sense. Good examples would be cases where you wish to retain the structure of the data, but don't have control over it. Sounds like you're describing a case like that.
I’ve used it more for generating forms, e.g. for when there are different types of questions or modules can ask or show the user —- multiple choice, free response, upload an image, watch a video, etc. So I’ve used JSON here both for storing the questions and responses, and then have used validators for each question type and response type. The downside is that it’s hard to do migrations. But in most cases you can just save the rendered HTML each user is shown and that covers you for having an audit trail, rather than maintaining backwards compatibility indefinitely. And if you do need to maintain compatibility forever, the schema validators enable that.

Having done or seen almost the exact same use for JSON at multiple startups, for whatever downsides there are I think it’s 10x better than needing to try to understand a bunch of models, each with their own queries and logic.

100% agreed, JSON fields are good when you want to dump a bunch of data you don't care about the shape of, but if you need a schema, pull the data into actual fields.
For a database like postgres, or even for something like SQLite, this mostly becomes a distinction without a whole lot of difference, since the database can index and access JSON structures just like regular columns.
JSONB columns are incredibly useful if you have data that doesn't fit well into a strict relational schema. And they are very powerful in terms of query ability in Postgres, but still quite far from a conventional relational schema.

Queries that act on the insides of a JSONB column are much harder to write than the equivalent conventional queries. They are also much, much slower in certain cases as the DB has to read the entire JSONB blob in many cases, and because they don't have proper statistics. The performance can range from slightly slower to completely pathological query plans that take ages.

I'm a big fan of JSONB in Postgres, but it is no replacement for a relational schema, there are far too many problems you inflict on yourself if you try to use it like that.

Access, but not constrain or type-check.

And it's just harder to use, I'm with it being generally a 'mistake' (in the 'you will regret this later' sense) - it's better than text, but I'm in the camp of as much structure as possible ('you will thank yourself later').

> Access, but not constrain or type-check.

You can create CHECK constraints involving JSON operations in postgres, which is the same way you'd enforce any constraint.

Manually, right? Django autogenerates all of the SQL schema for you on non-JSON fields. You could replicate that work, but that sounds tedious.
As others have mentioned, it definitely does have downsides when it comes to writing queries and especially when using frameworks/libraries to interface with the db.

But for the right job, JSONB columns are awesome!

What’s a real world example where you would choose to index the JSON as you said rather than putting them into the fields/schema.
Any time you'd otherwise use EAV.
EAV is considered an antipattern. While wouldn't rule out using it, it probably means it should be avoided if there is another solution.
I think there was a bit of a confusion on the implied use case of our project. As another commenter pointed out, when the data is user supplied and is not always uniform in nature, creating migrations for each adjustment to the custom fields they can modify becomes virtually impossible. I ended up settling for JSON fields because I don't have to migrate the data or the database schema; I can just do a best-guess conversion for the data when it hits the UI, and supply some validation warnings if it really doesn't make sense (e.g. a non-date string somehow makes it into a "date" field).
>>In projects I've been involved with, storing JSON in the database has often turned out to be a mistake.

I mostly agree. The only JSON I store in the database is UI settings, because it's easy enough to parse it with JavaScript (and fall back in default if the JSON doesn't contain that prop). It also decouples your database schema from the UI, the latter of which can evolve much more rapidly.

For everything else, having a schema is better.

Can you please elaborate on the "wealth of tooling" part? Makes me wonder what I don't know...
This is the best comment I've read on HN in a year, thank you. Some great stuff in here I didn't know existed, will save me a ton of time and frustration in the future.
Also the ModelAdmin won’t be friendly for JSON data.
We've been using it pretty heavily and the ability to do deep queries like Person.objects.filter(data__family__spouse__name__istartswith="oli") is great for one-offs but the performance stinks.

As it well should —this isn't a complaint— just a reminder that keeping denormalised, first party data that can be well indexed is invaluable when you're talking about huge datasets.

It'd also be nice to layer in JSON Schema directly into the modelling but there is a draft-7 compliant Python project that you can use for this in the model's clean() method.

Postgres allows you to create an index on keys within a JSONB column[0]. It may be the case that Django doesn't offer support for this but I would expect you can probably get around that with raw SQL.

[0]: https://www.postgresql.org/docs/current/datatype-json.html#J...

In addition to raw SQL, postgres JSON fields can be indexed with GinIndex https://docs.djangoproject.com/en/3.0/ref/contrib/postgres/i...
Yeah I've not had too much luck with GINIndex. Firstly I can't rule out user error but they just didn't work well. Maybe it was the volume (and the index generation wasn't keeping up with the ingress of data —"only" around 2w/s— or it's the 50M rows...

This was WORM data so denormalising on write didn't need update triggers, and was the difference between 100ms and 10s queries. That was still too slow for us so we started caching expected data outside postgres.

> data__family__spouse__name__istartswith="oli"

that's some syntax right there :)

Couldn't tell if you meant to say it was good or bad, but actually I found it to be crystal clear which is impressive for a complicated dataset.
neither, I guessed it was better than what was prior even though it stings the eye
How long have you been using this?

My intuition says this is one of those tradeoffs that is fast and easy in the short run, but over time becomes technical debt. I haven't used these over the long term, though.

Yep - SQLAlchemy's JSONB type is awesome as well. It's made life so much easier for relatively unstructured data!
Can you give some examples where a relational schema isn’t suitable?
Storing historical webhook notifications
The bonus with this is that at the start you are probably only interested in a few of the fields in the webhook, but as your application develops, you may need to extract more. If you've stored the JSON of previous webhooks, you have an easy migration path.
Sparse data.

For example, if there are 1000 possible attributes, only 5% are populated for any given row. If you have 1000 columns you are going to have a ridiculously wide and mostly empty table.

In Postgres, NULL values take one bit of storage as far as I understand. So a table has to be very, very wide before this becomes a benefit. Of course if you have attributes that you don't control but are e.g. user-specified, JSONB column are a good choice.
The traditional approach to efficiently modelling data like that with relational databases is to use the EAV[1] model, which works quite well in practice.

[1]: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

I use the JSONP column type in Postgres for data that is structured but user-supplied.
The user is supplying json?
There is a key-value interface so not directly.
Why not store this data into a relational database?
Because values can be different types (string, int, boolean).
Product data on a webshop. If you have a wide selection of product categories you’ll have a huge number of attributes that aren’t relevant across your categories.

There’s a number of ways to deal with the issue, e.g. a model per category or an EAV database pattern (this is what platforms like Magento does). None are really ideal, but storing the product attributes as JSON works pretty well. Even more so when your database supports querying the JSON blob.

Storing arbitrarily complex Boolean queries (think Elasticsearch's query syntqx). While this could be done in SQL tables, there are no gains with referential intregity and while there would be some in consistency, I don't think they're worth it.
Speaking strictly of schema...

Relational schema are flexible and allow for an optimizer to figure out how best to fetch your data based on the data are querying and filtering for and stored statistics about your data set.

Document oriented storage is great when you don't need the optimizer because you already know how the data is written and read. This means you can bundle it up into small, single fetch documents. No statistics or optimizer necessary. This is great if you understand your use cases really well, and they never change (good luck with that) or you have a large distributed data set that would be tough on an analyzer.

There are scenarios when you can't avoid EAV and then JSON is great even when it's for denormalized data.
Data that tends to change after the application is deployed.

More importantly, JSONFields can be used as a substitute for EAV pattern.

Looks like similar to Pydantic (included by default in Fastapi)
Typical use case: There are a number of required fields that your app won't work without. Those should be columns, probably non-nullable columns.

Then there're a bunch of optional fields. Traditionally, a number of nullable cols would be created. But they're ultimately messy (need to null check every accesses) and unneeded since you can replace them with a single json item. Keys are always optional, so you always need to check for presence or absence, and modern dbs support json natively in many clauses.

Fields are messy because you have to check for nulls, yet JSON keys aren't messy because you have to check for nulls?
All json keys are optional, so you ALWAYS need to check for their presence or absence. Mess exists when some key is nullable and some isn't. Checking for nullability on a key that can never be null isn't extremely elegant.
Why is that a mess? I don't see the difference.