Hacker News new | ask | show | jobs
by theptip 2686 days ago
This covers a lot of ground that I've recently had to learn the hard way.

One item I've been considering; under Downtime, a reason for flakes in migrations is "long running transactions".

I've seen this too, and wonder if the correct fix is actually to forbid long-running transactions. Typically if the naive long-running transaction does something like:

    with transaction.atomic():
        for user in User.objects.all():
            user.do_expensive_thing_to_related_objects()

You can often recast that migration to something more like

    for user in User.objects.all():
        with transaction.atomic():
            user = User.objects.get(id=user.id)  # Read the row to lock it; or do a SELECT FOR UPDATE
            user.do_expensive_thing_to_related_objects()
This example is somewhat trivial, but in most cases I've seen you can fetch your objects outside of the transaction, compute your expensive thing, and then lock your row for the individual item you're working on (with a sanity-check that your calculation inputs haven't changed, e.g. check the last_modified timestamp is the same, or better that the values you're using are the same).

I've considered simply configuring the DB connection with a very short connection timeout (something like 5 seconds) to prevent anyone from writing a query that performs badly enough to interfere with other tables' locks.

Anyone tried and failed/succeeded in making this approach work?

The other subject that's woefully underdeveloped is writing tests for migrations; ideally I want to (in staging) migrate the DB forwards, run all the e2es and smoke tests with the pre-migration application code, migrate back (to test the down-migration), run the e2es again, and then really migrate forwards again. That would cover the "subtly broken deleted field" migration problem.

But how do we test that our migrations behave correctly in the face of long-running transactions? I.e. what's the failing test case for that bug?

2 comments

> But how do we test that our migrations behave correctly in the face of long-running transactions? I.e. what's the failing test case for that bug?

Isn't it enough to simply make sure the migration transaction successfully finished? Even if there is a long running transaction, if the migration finished, that long tx will get aborted and rolled back.

Or if the migration stalls because the long running tx, then you'll presumably get a timeout error.

Is there something I'm missing?

No,

  with transaction.atomic():
    for user in db.users.all():
      user.do()
is not the same as

  for user in db.users.all():
    with transaction.atomic():
      user.do()
If the first fails, the entire data rolls back, if the second fails, half your db might be in an inconsistent state.
Yes, of course, they are not the same. I did not claim they were. I said:

> You can often recast that migration to something more like

By which I mean, in my experience you can usually write your migrations so that your code can work with the old AND the new version of the data, in which case you don't need to have a transaction around the whole operation.

This takes more work but is safer:

https://martinfowler.com/bliki/ParallelChange.html

https://www.amazon.com/Refactoring-Databases-Evolutionary-pa...