|
|
|
|
|
by pm90
1317 days ago
|
|
It should be viewed as a cat and mouse game. The general philosophy at these orgs is that the same failures should never happen _again_. So you build automation and safeguards protecting the system from the failure modes you know. However, any complex system will have failure modes you don't know. There might be new software, new features, new APIs etc. going out that interact in complex ways. So complex systems will fail in very interesting ways. So the general philosophy in operating these systems is: 1. First get the system back into a state in which the problem is mitigated.
2. Apply some short term hacks, rollback any suspicious recent changes.
3. Have someone go a bit deeper and try to root cause what caused the failure, have a discussion about it with impacted teams (often called a postmortem) and come up with long term fixes that reduces or eliminates the root cause from happening again. |
|