| > We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data 1. You snapshot your RDS database (or use one of the existing ones I hope you have) 2. You restore that snapshot into a database running in parallel without live traffic. 3. You run the test upgrade there and check how long it takes. 4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer. I agree it's a good project to exercise some "migration" muscle, it just doesn't seem like the payoff is there when, like I mentioned above, AWS supports this out of the box from now on since you upgraded to a version compatible with their zero downtime native approach. I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability. By the way, I realize how I come across, and let me tell you I say this having myself done projects like this where looking back I think we did them more because they were cool than because they made sense. Live and learn. |
I'm sympathetic to your comment that 15 minutes of planned downtime is fine for approximately 100% of SaaS companies. That's probably true here too, and maybe the work of doing this kind of upgrade was a waste in that regard. But, in considering the kind of product experience we would want for ourselves, zero downtime seems better than no downtime. The opportunity cost of feature work over the same window is real, but so is the reputation we hope to build as a platform that "just works" even if it seems crazy the lengths we might go to so that our customers don't have to think about it.