Hacker News new | ask | show | jobs
by code_research 3680 days ago
now with extendended network support one question gets even more important: how do you do rollbacks with ansible? There is no default mechanism or policy that seems to help with that, so I have to hand-roll my rollbacks?
5 comments

The simple answer is: you don't, you "roll forward."

In the event you deploy some code, a DB migration, a server configuration change, etc, and your solution fails after the fact, you move forward, not backwards. Let me explain further.

If you deploy v1.0 of your application and it works, great! If you then deploy v1.1 and it falls over, you find out why, apply a fix, test it locally (Vagrant?), deploy it to a testing environment and perform automated tests (Selenium, jMeter, ...), and once it's working there, you deploy it to production. This is called a hot fix, and it will now be working as intended (unless something else is horribly off the mark in which case you have other issues.)

The key to this example is the local and remote/network-based testing environment(s.) In my opinion, it's very much a realistic goal for ALL organisations of ALL sizes to operate local development environments using Vagrant and VirtualBox; a testing environment that spreads out the whole solution over multiple boxes (for testing networking code and configuration, among many other things); a staging environment for running performance tests (staging should match production bit-for-bit, cpu-for-cpu, ram-for-ram, ...) using jMeter or your choice of tooling; and finally a production environment to serve clients. This is the absolute minimum all organisations should be aiming for, and it doesn't even have to be fully automated using CI and/or CD.

Also tests, such as unit tests, systems tests, integration tests, usability and performance tests, and so on, are also critical to preventing the need to roll back and instead, implementing a roll forward policy.

Curious if "roll forward only" could create situations where a failed version change could place the system of interest in a non-functional state until the problem was diagnosed, the code revised, and an update released. If that's possible, I would have concerns about the infrastructure meeting the core needs of the business such as providing value to cutomers.
Your systems is already down, rolling back is the same thing, if not more effort than rolling forward. At least that's what I've always found.

Another option is to have customers point at stage after it has been upgraded and if it all goes horribly wrong, a load balancer change should be enough to point people back at the older production environment.

All this being said, problems in production shouldn't be a thing with configuration management, infrastructure as code (Terraform), and tests, not to mention three environments (development,test, stage - at minimum) to work your way through before pushing to production.

> All this being said, problems in production shouldn't be a thing with configuration management, infrastructure as code (Terraform), and tests, not to mention three environments (development,test, stage - at minimum) to work your way through before pushing to production.

You'll still have problems, you've just automated them now. Those tools and approaches are great, but do they really prevent all production issues to the point where they "shouldn't be a thing"?

Keeping a system down while waiting for a hotfix is not an option for most operations. Rollbacks have their place and hotfixes have their place.
Taking a snapshot of a system before making a change and rolling back to that snapshot would be faster. In any case, a strong policy of only pushing changes to production that have been properly tested in staging will protect you the most.
Why not just roll back while you're testing the fix? No need to be suffering unnecessary downtime while you hunt, fix, test, package, stage, and deploy the hotfix. Rolling back takes you to a version that has already passed all stages.
John Wilkes of Google talks about this problem with Jeff Meyerson [1] and how it relates to the choice to use or not use containers. The spoiler is that container management tooling allows separation of infrastructure builds from deployment: a configuration problem when building a container happens on the build server instead of while a script is running on machine in production. His argument is that when a container deployment to production fails, the state of the machine is readily known (new bad container) versus an more complex state when a scripted build fails part way to completion.

And a container management tool can facilitate handling a failed distribution automatically via rollback to a previously deployed working container.

http://www.se-radio.net/2016/01/se-radio-show-246-john-wilke...

Revert your playbooks and roles to the version of your last good deployment, and redeploy. With good version control, role version management and idempotent library modules, this should be functionally equivalent to a rollback.

There are plenty of caveats to the above (like the fact that the yum module won't downgrade [1], and you'll need reversible DB migrations) but that's basically the procedure.

[1] https://github.com/ansible/ansible-modules-core/issues/1419

This isn't quite accurate. It won't uninstall or remove things that a previous version put into place, unless you explicitly remove them before installing them as part of your playbooks/roles.
Exactly. This whole "declare your environment" thing with Ansible doesn't work.

I've completely mixed experiences with Ansible. Yes, it's easy to get started, but it's certainly annoying having to create playbooks for removing stuff to get a clean state.

To be frank, my experience with configuration management has been a mix between "YES! THIS IS WHAT WE NEED!" and "...but it still doesn't adhere to immutable states." That's been true with Chef, Puppet, and Ansible, for me. I haven't experimented with other techs.
Depends how you write your playbooks/roles. You can write a role that will both add and remove depending on the value of a variable in your inventory. Then tweak the inventory and re-run.
IMO, the use case for rollbacks with Ansible (or Chef/Puppet etc) are done through redeploying prior releases not through trying to remove/replace software on an instance. Same with if you are rolling out a server configuration update (like a certificate), if you need to roll it back you send out the prior configuration.

Am I missing some other detail?

Hand-roll them like a fine cuban cigar. Jokes aside, I would not inherently trust an automatic rollback even if Ansible did support it. You must always provision for worst case failures.
Yes. There is no way to roll back.