|
|
|
|
|
by bdamm
2539 days ago
|
|
In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better. When your customers are losing millions of dollars in the minutes you're down, mitigation would be the thing, and analysis can wait. All that is needed is enough forensic data so that testing in earnest to reproduce the condition in the lab can begin. Then get the customers back to working order pronto. 20 minutes seems like a lifetime if in fact they were concerned that the degradation could happen again at any time. 20 minutes seems like just enough time to follow a checklist of actions on capturing environmental conditions, gather a huddle to make a decision, document the change, and execute on it. Commendable actually, if that's what happened. |
|
Bryan Cantrill has a great talk[0] about dealing with fires where he says something to the effect of:
> Now you will find out if you are more operations or development - developers will want to leave things be to gather data and understand, while operations will want to rollback and fix things as quickly as possible
[0] Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill: https://www.youtube.com/watch?v=30jNsCVLpAE