Hacker News new | ask | show | jobs
by simonw 1155 days ago
I think this problem can be robustly solved if you have the right mechanisms in place:

1. Migrations. Your schema needs to live in version control, and changes to your schema must be applied by an automated system. Django migrations are the gold standard here in my opinion, but you can stitch together a custom system if you need to, one that tracks which migrations have been run already and provides a mechanism to apply new ones.

2. Automated tests. Your triggers MUST be covered by automated tests - call them unit tests or integration tests depending on your preferences, but they need to be exercised. Don't fall into the trap of mocking your database in your tests!

With these two measures in place I'm 100% comfortable using triggers that split my business logic between my application code and my database code.

3 comments

I don't think migrations (at least as done by Django et al.) solve it - you want a declarative source of truth for what the schema looks like today, not a chain of changes that only tell you that after computing the combined effect.

Even if they just created a generated file of the final schema, that sat in version control and errored the makemigrations check (just like a missing migration) if it was out of sync, that would be a significant improvement IMO. But I think the Django maintainers at least would say they want the ORM DSL to be that. (But it's way too incomplete, you'd be limited to a tiny subset of postgres, and even then you have to be on board with that being a reasonable description of your schema, not wanting the actual db schema anywhere.)

If you really, really need to be able to see a SQL schema representing the current state, a cheap trick is to run an automation on every deploy that snapshots the schema and writes it to a GitHub repository.

I do a version of that for my own (Django-powered) blog here: https://github.com/simonw/simonwillisonblog-backup/blob/main...

This comes up a lot, and I'm always surprised that more people don't know about Flyway and "repeatable migrations": https://flywaydb.org/documentation/tutorials/repeatable

> Repeatable migrations are very useful for managing database objects whose definition can then simply be maintained in a single file in version control. Instead of being run just once, they are (re-)applied every time their checksum changes.

One of my pet projects addresses these points you raise about Django + migrations, have a look:

https://pypi.org/project/DBSamizdat

In a nutshell it allows you to keep DB functions, triggers and views around as Python classes, so you can version them together with the rest of your application code. The DB state gets updated for you (in the right dependency order) whenever you change them.

It can also run without Django.

This is exactly what I would use dbt for
Had a thought recently. What's stopping someone from separating table migrations from others like functions/triggers?

The standard tables can use the standard process while triggers, etc are run declaratively. When you change something, a tool could simply tear everything down and rebuild it. No need to worry about data since those are handled separately.

Would performance be a concern? Or is there something I'm missing?

> you want a declarative source of truth for what the schema looks like today, not a chain of changes that only tell you that after computing the combined effect

Applying a series of migrations to get a final db schema is not much different than a version control system like git.

Ok, so how do I checkout a version and view the file/schema tree?

If you like, you can view my comment above as saying 'if they included that tooling, not just the similar tree of changes stored, it would be better'.

How to get from one state to the next is interesting to the computer, not to me (not after I've initially written it/done it in prod anyway), I'm interested in the result, where do we stand after all of the migrations (currently on disk having checked out whatever).

You checkout that version, then run the migrations from scratch against a fresh database.

In Django that's "./mange.py migrate".

And then inspect the db, exactly. In other words you can't, there is no declarative 'current state' checked in or provided by the migration tooling.
If you are going to use some kind of meta-schema tool to manage migrations, you should store your declarative schema in whatever syntax this tool understands and allow it to generate the schema-mutating migration commands as runtime artifacts (more like a build output than a versioned source).

If not using such a tool, you might adopt a little meta-schema logic in your own DDL. With PostgreSQL, you can try writing idempotent and self-migrating schemas. Using syntax variants like CREATE OR REPLACE for functions and views, DROP ... IF EXISTS, CREATE ... IF NOT EXISTS, and ALTER ... ADD IF NOT EXISTS allows some of your DDL to simply work on an empty DB or a prior version. Wrapping all this within a pl/pgsql block allows conditional statements to run DDL variants. Conditionals are also useful if you need to include some custom logic for data migration alongside your schema migration, i.e. create a new structure, copy+transform data, drop old structure.

For smaller DBs, you may be able to afford some brute-force techniques to simplify the DDL. Things like DROP IF EXISTS for multiple earlier versions of views, indexes, or constraints to clear the slate and then recreating them with the latest definitions. This may add IO costs to execute it, but makes the DDL easier to read as the same DDL statements are used for the clean slate and the migration. Similarly, a little pl/pgsql logic could loop over tables and apply triggers, policies, etc. that you want to use systematically in a project.

If possible, you can also prune your code so that any specific version in your source control only has logic to handle a few cases, i.e. clean slate builds and a migration of N-1 or N-2 to latest. This minimizes the amount of branched logic you might have to understand when maintaining the DDL. The approach here depends a lot on whether you are versioning a product where you control the deployment lifecycle or versioning a product release which may encounter many different prior versions "in the wild".

In any case, you have a more complicated test requirement if you want to validate that your system works from multiple starting points. I.e. clean-slate builds as well as migrations from specific earlier versions. I think this is true whether you are using some higher level migration management tooling or just rolling your own conditional and idempotent DDL.

I don't understand. What are you looking for here?

If you want a plain text SQL file to look at you can have that with a bit of extra automation - for example, every time you tag a release you could have a script that runs "./manage.py migrate" against that checkout and then dumps the schema out to a file somewhere (an asset attached to the release on GitHub as one example).

It’s completely different as git does not store deltas but snapshots of whole files and file trees.
> Django migrations are the gold standard here in my opinion

Intriguing. In JavaWorld it's really only Flyway and Liquibase, and the important self-imposed rule, "Thou shalt never remove a column from a table."

I've always resisted the Django migration approach personally. One of the things I disliked most about it.
But what problem is moving to triggers solving? Is keeping all your logic in a single place actually a problem, or is moving part of it to Postres tempting because it's nifty? How much training will you have to provide for new hires to be able to run with Postgres triggers instead of just Django logic?
Triggers protect you against bugs.

If your application logic needs to remember to update a denormalized column somewhere it's very easy for a bug to slip in in the future when someone adds a new piece of Python code but forgets to update the denormalized column.

With triggers you can eliminate that source of bugs entirely.

This is even more valuable if you have more than one system interacting with a single database.

Triggers can also be significantly more performant depending on what you are doing.

But yeah having the business logic split like this is a hard sell. There needs to be an extremely good reason to do it. It adds developer overhead because not only do you have to maintain another test harness and deal with an entirely separate system encapsulating yours, every time you touch business logic you’re now wasting some brainpower deciding where it should live.

I did this with robust DB constraints and selective use of triggers and stored functions. It can feel like a strait jacket long term. Though it did help protect against less competent DBAs breaking things and avoid some footguns.