|
I worked on a little web service in a much larger system, that was deployed to the company's on-prem datacentres. To deploy the entire system, maybe it'd take a few hours: there'd be 1-2 dozen services, a few databases, probably 20+ external integrations. Usually we'd only deploy the services or components that were changing. Deploying a single service would only take a few minutes, even with human in loop manually triggering the deploy scripts. Deploying changes was decoupled from activating changes, to avoid outages due to deployments. There were two instances of the system running in production at all times, deployed to two datacentres. It was one giant monolithic blue-green deployment sitting behind a customer-facing load balancer. Suppose the "blue" prod system is currently the live instance of the system and "green" is the dark instance. You'd deploy your new release of your component to green, then once deployment was complete and seemed stable, someone would pull the big lever on the load balancer to start forwarding customer traffic to the green instance. For a while both prod instances would receive customer traffic, until all the timeouts for customer sessions being served by the blue instance kicked in, and they established new sessions with green. Then green would be live and blue would be fully dark. It'd usually take around 5 minutes or so to completely drain traffic from an instance. If you saw error rates spike on some component and wanted to abort, then you'd need to jam the big metaphorical lever on the load balancer the other way to direct all the traffic back again. Might take 5 minutes or so, again governed by the client session timeouts designed into the system. Usually the technical speed wasn't the bottleneck -- it's more like it'd take 15 to 60 minutes to get the business stakeholders into a room to make a decision on if they were willing to live with the errors or wanted to roll back to the old version. In this context the real bottleneck wasn't deployment or activation time, they were both fine. The bottleneck was on the pre-release test process in staging. There was a single staging environment for dozens of services owned and maintained by different teams, which would all be tested manually in lockstep. Changes had to be planned and coordinated weeks or months in advance, to get a test window. Releases happened every four weeks or so, if your change wasn't stable in time to enter the big heavy testing phase in the integrated staging environment, you missed the boat and you had to wait 4 weeks for another try. |