Hacker News new | ask | show | jobs
by brentjanderson 925 days ago
OP here: It’s true that all services have downtime for one reason or another. We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data. Having a replica on PG 15 that was up to date with production data was invaluable for verifying our workloads worked as expected. Using a live replica makes it possible to trial run in production with minimal impact.

A key learning for me from this migration was how nice it can be to track and mitigate all of the risks you can think of for a project like this. The risk of an in-place upgrade in the end seemed higher than the risks associated with the route we chose, outage windows notwithstanding.

As a bonus, if we need this approach in the future, this blog post should give us a head start, saving us many weeks of work. We hope it helps other teams in similar situations do the same.

1 comments

> We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data

1. You snapshot your RDS database (or use one of the existing ones I hope you have)

2. You restore that snapshot into a database running in parallel without live traffic.

3. You run the test upgrade there and check how long it takes.

4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer.

I agree it's a good project to exercise some "migration" muscle, it just doesn't seem like the payoff is there when, like I mentioned above, AWS supports this out of the box from now on since you upgraded to a version compatible with their zero downtime native approach.

I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability.

By the way, I realize how I come across, and let me tell you I say this having myself done projects like this where looking back I think we did them more because they were cool than because they made sense. Live and learn.

We actually did those steps as part of our overall assessment, and you're right that we could have taken an outage window for that long and called it a day. We decided the tradeoff wasn't worth it for our situation, but taking the outage window is definitely a viable option.

I'm sympathetic to your comment that 15 minutes of planned downtime is fine for approximately 100% of SaaS companies. That's probably true here too, and maybe the work of doing this kind of upgrade was a waste in that regard. But, in considering the kind of product experience we would want for ourselves, zero downtime seems better than no downtime. The opportunity cost of feature work over the same window is real, but so is the reputation we hope to build as a platform that "just works" even if it seems crazy the lengths we might go to so that our customers don't have to think about it.

> The opportunity cost of feature work over the same window is real, but so is the reputation we hope to build as a platform that "just works" even if it seems crazy the lengths we might go to so that our customers don't have to think about it

This part can definitely make sense, and if nothing else it can foster an engineering culture of "we care", which is great. I just wanted to show the other side but from your answers it seems like the team weighted the options. It's definitely a cool project to work on. Thanks a lot for engaging with a random grumpy guy on HN!

Random comment, but just wanted to say I really appreciate your blog post, but also I appreciate the informative and helpful discussion between you and vasco here. Feel like this could have easily devolved into defensiveness on either side, but instead I learned a lot from both of your responses - I feel like these kinds of interations are HN at its best. Thanks!