Hacker News new | ask | show | jobs
by brentjanderson 923 days ago
OP Here -

1. There was zero downtime - no dropped requests, no 5xx errors. There _was_ a latency spike that was carefully tuned to be within timeout limits for our customers, but we dropped zero requests from the cut over.

2. Yes, it's very tedious, and in its own way painful. We also did a MongoDB upgrade recently and, while we still took the time to verify our workloads on the more recent versions, because Mongo is an AP system, it's trivial to failover to the new version and move on.

That said, the application-level logic changes were not particularly complicated. The script to orchestrate the cutover was application-specific, and I think for migrations like this you have to do the work to get it done right.

I'd also add that the tedium of doing it right, while ideally avoidable, is precisely why customers pay us to do handle this complexity on their behalf. Sometimes you've just got to do the work. They want a service that's up all the time. While no one can guarantee that, we strive for it within reason, and even then going to "unreasonable" lengths to have a better customer experience is exactly what makes many products unreasonably good.

Stretching the work out and taking each step carefully did avoid critical mistakes. We had a few missteps along the way, and we were able to rollback without critically affecting the service. Doing an in-place upgrade, trying to minimize the time spent on this problem, would have been far more risky than spreading that risk out over the whole process we took. Of course, each team needs to figure out what's going to work for their situation & constraints.

3. We do use Aurora, but our instance was old enough to not be supported for zero-downtime patch upgrades (ZDP) which does not handle major version upgrades. They also recently released blue/green deployments for Aurora Postgres clusters, which may be a way to do what we did without having to resort to as many changes.