| HN Mirror

Yeh. Or, during a complex investigation, you need to setup a hypothesis explaining these spikes in order to eventually establish causality surrounding these spikes. And once you have causality, you can start fixing.

For example, I've had a disagreement with another engineer there during a larger outage. We eventually got to the idea: If we click that button in the application, the database dies a violent death. Their first reaction was: So we never click that button again. My reaction was: We put all attention on the DB and click that button a couple of times.

If we can reliably trigger misbehavior of the system, we're back in control. If we're scared, we're not in control.