Hacker News new | ask | show | jobs
by shoo 1214 days ago
I worked on a little web service in a much larger system, that was deployed to the company's on-prem datacentres.

To deploy the entire system, maybe it'd take a few hours: there'd be 1-2 dozen services, a few databases, probably 20+ external integrations. Usually we'd only deploy the services or components that were changing. Deploying a single service would only take a few minutes, even with human in loop manually triggering the deploy scripts.

Deploying changes was decoupled from activating changes, to avoid outages due to deployments. There were two instances of the system running in production at all times, deployed to two datacentres. It was one giant monolithic blue-green deployment sitting behind a customer-facing load balancer. Suppose the "blue" prod system is currently the live instance of the system and "green" is the dark instance. You'd deploy your new release of your component to green, then once deployment was complete and seemed stable, someone would pull the big lever on the load balancer to start forwarding customer traffic to the green instance. For a while both prod instances would receive customer traffic, until all the timeouts for customer sessions being served by the blue instance kicked in, and they established new sessions with green. Then green would be live and blue would be fully dark. It'd usually take around 5 minutes or so to completely drain traffic from an instance.

If you saw error rates spike on some component and wanted to abort, then you'd need to jam the big metaphorical lever on the load balancer the other way to direct all the traffic back again. Might take 5 minutes or so, again governed by the client session timeouts designed into the system. Usually the technical speed wasn't the bottleneck -- it's more like it'd take 15 to 60 minutes to get the business stakeholders into a room to make a decision on if they were willing to live with the errors or wanted to roll back to the old version.

In this context the real bottleneck wasn't deployment or activation time, they were both fine. The bottleneck was on the pre-release test process in staging. There was a single staging environment for dozens of services owned and maintained by different teams, which would all be tested manually in lockstep. Changes had to be planned and coordinated weeks or months in advance, to get a test window. Releases happened every four weeks or so, if your change wasn't stable in time to enter the big heavy testing phase in the integrated staging environment, you missed the boat and you had to wait 4 weeks for another try.

1 comments

How do you handle database/data store upgrades? It seems there is a window where both the old and new system write into the same data store.
Great question. I don't know the details of how database schema changes were deployed, I only worked on the easy stuff - stateless web services.

Exactly as you say, there is a time window where both the old and new system write to the same data store. Both old and new systems, and the details of the deployment, need to be designed to tolerate this. Even if there is no change to the database schema, you need to think through what will happen if the old version of a component reads data written to the database by a newer version of that same component, or vice versa. Similar considerations if you need to roll back to the old version after the new version has run in production for a few hours, but the newly written data is still there. This can all be planned out and tested in staging.

I don't think this is unique to the blue / green deployment pattern. If you did a rolling deployment to upgrade app servers in a pool behind some customer-traffic facing load balancer, there would be a time window when both old and new versions of your app servers are all attached to your database. Same fundamental problem.

We intend to do something similar. The users get a pop-up in the front end when a new version is available and are asked to load the new version, but they can keep on using the old version for a while. The length of the time window does not really matter that much: whether it is 5 minutes or several hours, the issue is the same. Regarding the database upgrades, I believe that if you really want to, you can split up a not-backwards compatible upgrade into multiple upgrades that are each one on one backwards compatible. But that is extra work and like you said you really have to think those things through for every change you roll out. So I think the extra effort is not always worth it and also I wonder how other people are approaching this.