|
|
|
|
|
by brentjanderson
925 days ago
|
|
OP here: It’s true that all services have downtime for one reason or another. We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data. Having a replica on PG 15 that was up to date with production data was invaluable for verifying our workloads worked as expected. Using a live replica makes it possible to trial run in production with minimal impact. A key learning for me from this migration was how nice it can be to track and mitigate all of the risks you can think of for a project like this. The risk of an in-place upgrade in the end seemed higher than the risks associated with the route we chose, outage windows notwithstanding. As a bonus, if we need this approach in the future, this blog post should give us a head start, saving us many weeks of work. We hope it helps other teams in similar situations do the same. |
|
1. You snapshot your RDS database (or use one of the existing ones I hope you have)
2. You restore that snapshot into a database running in parallel without live traffic.
3. You run the test upgrade there and check how long it takes.
4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer.
I agree it's a good project to exercise some "migration" muscle, it just doesn't seem like the payoff is there when, like I mentioned above, AWS supports this out of the box from now on since you upgraded to a version compatible with their zero downtime native approach.
I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability.
By the way, I realize how I come across, and let me tell you I say this having myself done projects like this where looking back I think we did them more because they were cool than because they made sense. Live and learn.