Hacker News new | ask | show | jobs
by brudgers 3680 days ago
Curious if "roll forward only" could create situations where a failed version change could place the system of interest in a non-functional state until the problem was diagnosed, the code revised, and an update released. If that's possible, I would have concerns about the infrastructure meeting the core needs of the business such as providing value to cutomers.
1 comments

Your systems is already down, rolling back is the same thing, if not more effort than rolling forward. At least that's what I've always found.

Another option is to have customers point at stage after it has been upgraded and if it all goes horribly wrong, a load balancer change should be enough to point people back at the older production environment.

All this being said, problems in production shouldn't be a thing with configuration management, infrastructure as code (Terraform), and tests, not to mention three environments (development,test, stage - at minimum) to work your way through before pushing to production.

> All this being said, problems in production shouldn't be a thing with configuration management, infrastructure as code (Terraform), and tests, not to mention three environments (development,test, stage - at minimum) to work your way through before pushing to production.

You'll still have problems, you've just automated them now. Those tools and approaches are great, but do they really prevent all production issues to the point where they "shouldn't be a thing"?

Keeping a system down while waiting for a hotfix is not an option for most operations. Rollbacks have their place and hotfixes have their place.
Taking a snapshot of a system before making a change and rolling back to that snapshot would be faster. In any case, a strong policy of only pushing changes to production that have been properly tested in staging will protect you the most.
Why not just roll back while you're testing the fix? No need to be suffering unnecessary downtime while you hunt, fix, test, package, stage, and deploy the hotfix. Rolling back takes you to a version that has already passed all stages.
John Wilkes of Google talks about this problem with Jeff Meyerson [1] and how it relates to the choice to use or not use containers. The spoiler is that container management tooling allows separation of infrastructure builds from deployment: a configuration problem when building a container happens on the build server instead of while a script is running on machine in production. His argument is that when a container deployment to production fails, the state of the machine is readily known (new bad container) versus an more complex state when a scripted build fails part way to completion.

And a container management tool can facilitate handling a failed distribution automatically via rollback to a previously deployed working container.

http://www.se-radio.net/2016/01/se-radio-show-246-john-wilke...