|
|
|
|
|
by Silhouette
2529 days ago
|
|
It's not always as simple as that. What if the problem was that something in a change didn't behave as specified and wound up writing important data in an incorrect but retrievable format? Rolling back might not recognise that data properly and could end up either modifying it further so the true data could no longer be retrieved or causing data loss elsewhere as a consequence. |
|
There are certainly changes that cannot be rolled back such that the affected users are magically fixed, which is not what I am suggesting. In the context of mission critical systems, mitigation is usually strongly preferred. For example, the Google SRE book says the following:
> Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct!
> Instead, your course of action should be to make the system work as well as it can under the circumstances. This may entail emergency options, such as diverting traffic from a broken cluster to others that are still working, dropping traffic wholesale to prevent a cascading failure, or disabling subsystems to lighten the load. Stopping the bleeding should be your first priority; you aren’t helping your users if the system dies while you’re root-causing. [...] The highest priority is to resolve the issue at hand quickly.”
I have seen too many incidents (one in the last 2 days in fact) that were prolonged because people dismissed blindly rolling back changes, merely because they thought the changes were not the root cause.