Hacker News new | ask | show | jobs
by PaulMest 2603 days ago
I am guessing that you are probably not using Python/Django... but is this any different than what Django offers?

Django allows you to define your models (schema) and then run a command that will generate the migrations. If you don't like the migration that was generated, you can modify it. You can customize up and down operations.

There are also tools that will take an existing database and generate the Django models for you.

All of these operations can also export the exact SQL that will run on each supported RDBMS platform in case you want to be extra sure on what exactly will be executed.

3 comments

Django migrations can be problematic because they're meant to be sequential and have interdependencies. I've had problems merging multiple feature branches because of this, even though there are no code conflicts.

A system like Saltstack or Puppet for databases would not have checked in migrations, these would be generated on the fly at deploy time.

So you could very well have multiple state changes in a single run, by comparing actual DB state and desired DB state, then creating the SQL code as needed for that deployment.

Honestly not having to fiddle with the migrations table on a live server seems pretty nice ;-)

This could very well turn out to be Django's next gen migration tool...

> Django migrations can be problematic because they're meant to be sequential and have interdependencies. I've had problems merging multiple feature branches because of this, even though there are no code conflicts.

They're actually a directed graph; this means a conflict wasn't handled on the branches that should have been, and would probably have been a problem regardless.

This was helpful to think about, thanks.

I've rarely encountered logical merge conflicts with migrations, but I could see it happening.

I used to be on the SQL Server team at Microsoft and had some exposure to the customer support teams. So data integrity and eliminating any potential for errors was huge.

So while I love the idea of migrations being generated on the fly from actual state in Production-System-5 to desired state of commit 27a73e, I'm skeptical of it working that well in practice. Certain cases come to mind where there might be intermediate migrations from [full name] -> ([first name] [last name]) -> ([first initial] [last name]). The system would have to be smart enough to know A -> C may require A -> B -> C or prompt the engineering team for a better DML migration script.

Also, you will want there to be documentation about what was performed whether that is a migrations table that points to a .py file... or a .json output... or a log file.

Yeah. I’d love to see the academic paper with formalizations that help me understand the true scope of this problem. Your example is a great one that prompts many questions. Is it possible to travel directly to the commit o(1) or will the code have to calculate the diff of each commit and apply them one at a time o(n) and how much definition and dependency mapping humans need to do to have it work correctly?
The closest I can think of is trying to define a set of CRDT-compatible operations that are expressive enough to describe your database schema, starting from an empty database. Then, the migration you need to perform is whatever the CRDT merge operator says you need to do.
That's great initially, but problems definitely crop up at scale:

* What happens when your company creates new systems that aren't Python/Django? You can either still shoehorn all migrations into Django models, or have multiple separate schema change processes/pipelines... both options are not good.

* If someone makes an out-of-band schema change manually (either by accident or to do a rapid hotfix), you're no longer on a known version. Not sure about Django, but when this happens, most traditional migration systems cannot operate without additional manual cleanups... whereas declarative tools can inherently transition any state to any other state.

* Depending on the DBMS, with large tables and certain types of ALTERs, using an external online schema change tool is greatly preferable to just running an ALTER directly.

* Does Django support sharding / the notion that a change to a sharded model must be made to multiple shards?

* I see your point on not standardizing on one framework. Generally when that has happened for me, it turns into a new service and it has its own database/tables/migration management. It does get quite annoying, for sure.

* I've seen enough things go wrong that on my teams I do not allow DDL to be executed outside of a controlled process that comes from code. But yeah, if that were to happen, it would annoying to figure out what was done and then try to re-model.

* With Django you can specify exact SQL to run. So you can break up operations into multiple smaller steps... canonical example is building a new column based on an old column. You first add the column with NULL. Then you populate in batches of ~10k records. Then you add on the constraints/indexes.

* I haven't used Django with sharding. It appears there are some posts about it, but it all appears to be community generated content and not part of the official docs.

All-in-all, I could see that at a large scale with very mature engineering organizations with lots of activity and complex operations that something like Django could fall short and a home-grown system like this may be beneficial, assuming it were reliable enough.

Sharding is usually handled by patching the QuerySet / ObjectManager, how Citus handles it is a good example: http://docs.citusdata.com/en/v8.1/develop/migration_mt_djang...
Prepare for the exciting future of DevOps transformation: it's `./manage.py makemigrations`.