| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by asleepawake 1548 days ago
	that’s awesome but when would it be useful? wouldn’t that lead to data loss?

2 comments

rorymalcolm 1548 days ago

If you drop a 'title' column from a users table, for example, you can revert and have the title column reappear with the dropped data, new users added during this time (while the column was dropped) will not have a title.

link

codingdave 1548 days ago

Just to clarify my understanding -

If we extend that scenario a bit to dropping a title column and at the same time adding a foo column. Then add rows with data in foo. Then revert. Do you lose the foo data?

Alternatively, can we separate those actions out? Drop columns in one migration. Add columns in another. Add rows and data. Then revert only the migration where columns were dropped, keeping the more recent adds?

link

shlomi-noach 1548 days ago

Right. So if you're both adding one column and removing another, then the revert will lose your new column and will regain your old column. Normally, you deploy DB and app in steps. E.g. if your migration adds a new column, your app is not yet aware of the column (or else your app would break). The moment the migration completes, your app is still on the not-knowing state. It takes an app deployment to actually start utilizing the new column. If you do that, and then want to revert -- you will lose any new data you've added to the new column.

In my experience, when a schema migration goes wrong, it goes wrong with a bang. It takes seconds to maybe one minute until pagers are alarming. So I'd say in a common scenario you will not get to deploy your app with the new column awareness, because you'll have realized the migration was bad right away.

> Alternatively, can we separate those actions out? Drop columns in one migration.

If you choose to do that, then you're on safer grounds; it costs you some wall clock time, because migrations do take a while to complete on medium to large tables.

Do note that Rewind only lets you rewind your most recent deployment (PlanetScale's app will not let you run the next migration before you've committed to, or have rewinded, the previous one).

link

shlomi-noach 1548 days ago

Engineer at PlanetScale; it will let you go back to safety without data loss, and without making your database inconsistent.

If you will indulge a realistic story; I've been through this process multiple times in production.

You change a large table via ALTER TABLE; you possibly change a data type, or drop a column, or modify an index. The change takes 5 hours to complete - and things go bad. Testing in staging was good, but as it turns out the production environment cannot cope with the changes and still needs the previous schema. Some traffic is still able to pass through, but some requests are erroring.

What do you do?

One option is to run another ALTER TABLE that takes you back into the original schema. This will take yet another 5 hours, during which your app may be degraded or altogether down. Plus you'll be unable to recover lost data (such as in a DROP COLUMN scenario). Another option is to do a point in time recovery for your entire database. This will both take time, but more importantly you will lose all the data you've accumulated since the migration completed. Any new user account, any new artifact, any new event - will be lost. Rows that were deleted suddenly reappear. Data that should not be available anymore suddenly is.

Most people will try a third option: do a point in time recovery on an offline server, and extract/copy just the specific table and copy it onto production. Typically this involves a lot of juggling and most environments will not have the infrastructure to automate the entire process. But even once this is done, you're still hit with the unfortunate implication: your data set is now both incomplete as well as inconsistent.

It is incomplete because data is missing from the restored table. Any rows accumulated since the point in time recovery point - are lost. It is inconsistent, because in many cases, due to the natural relational design of your schema, other tables will have rows that relate to the missing restored table's rows. You may try to then manually backfill those missing rows into the restored table (or remove rows previously deleted) , but in reality some processes will already have manipulated the data on the restored table even while you're trying to resolve the situation, leading to more conflicts.

It seems like the only safe way is to take everything offline, disable any writes to the broken tables as well as some of, or all tables, associated with it, resolve all conflicts, then restore data onto production and enable writes again. Or, you choose to lose data, track down any known conflicts, reach out to users and inform them of the data loss. Either way this has a significant impact on your service.

And so Rewind offers an instant fall back to your previous schema, while still retaining any data you've accumulated since the time of incident. Rewind resolves the differences between previous and current schema, and adapts the latest data changes onto the old schema. As you rewind the migration your table still has the same amount of rows, and maintains all incoming or outgoing references from and to other tables. It all happens on your production environment and does not require an offline server.

Here's a technical description of how Rewind works: https://planetscale.com/blog/behind-the-scenes-how-we-built-...

link

aboodman 1548 days ago

This is a really clear description of the problem and solution. Thanks!

I recommend amending the blog (or maybe making a new post) with this exact content. I regularly run SQL databases for smaller projects but was not able to immediately conceptualize how I would use this feature from just the blog post. Maybe your target audience would be able to? But I don't see the downside in just spelling it out clearly!

link

shlomi-noach 1547 days ago

Thank you, good idea!

link

edf13 1548 days ago

Sounds a bit iffy if your prod deployment fails after testing… surely the correct way is to have staging pickup any issues prior to prod

link

ledauphin 1548 days ago

lol.

I mean, yes... but also - have you really never seen a bug make it to production?

link

edf13 1548 days ago

Yes - but this kinda gives it an easy pass. Makes it easier to ‘move fast and break things’ which isn’t always best

link