Hacker News new | ask | show | jobs
by vasco 925 days ago
> We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data

1. You snapshot your RDS database (or use one of the existing ones I hope you have)

2. You restore that snapshot into a database running in parallel without live traffic.

3. You run the test upgrade there and check how long it takes.

4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer.

I agree it's a good project to exercise some "migration" muscle, it just doesn't seem like the payoff is there when, like I mentioned above, AWS supports this out of the box from now on since you upgraded to a version compatible with their zero downtime native approach.

I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability.

By the way, I realize how I come across, and let me tell you I say this having myself done projects like this where looking back I think we did them more because they were cool than because they made sense. Live and learn.

1 comments

We actually did those steps as part of our overall assessment, and you're right that we could have taken an outage window for that long and called it a day. We decided the tradeoff wasn't worth it for our situation, but taking the outage window is definitely a viable option.

I'm sympathetic to your comment that 15 minutes of planned downtime is fine for approximately 100% of SaaS companies. That's probably true here too, and maybe the work of doing this kind of upgrade was a waste in that regard. But, in considering the kind of product experience we would want for ourselves, zero downtime seems better than no downtime. The opportunity cost of feature work over the same window is real, but so is the reputation we hope to build as a platform that "just works" even if it seems crazy the lengths we might go to so that our customers don't have to think about it.

> The opportunity cost of feature work over the same window is real, but so is the reputation we hope to build as a platform that "just works" even if it seems crazy the lengths we might go to so that our customers don't have to think about it

This part can definitely make sense, and if nothing else it can foster an engineering culture of "we care", which is great. I just wanted to show the other side but from your answers it seems like the team weighted the options. It's definitely a cool project to work on. Thanks a lot for engaging with a random grumpy guy on HN!

Random comment, but just wanted to say I really appreciate your blog post, but also I appreciate the informative and helpful discussion between you and vasco here. Feel like this could have easily devolved into defensiveness on either side, but instead I learned a lot from both of your responses - I feel like these kinds of interations are HN at its best. Thanks!