|
|
|
|
|
by callalex
885 days ago
|
|
Running voodoo analysis on graph spikes is indeed a fool’s errand. What you really need is load testing on every component of your system, and alerts for when you approach known, tested limits. Of course this is easier said than done and things will still be missed, but I’ve done both approaches and only one of them had pagers needlessly waking me in the middle of the night enough to go on sleepless swearing rants to coworkers. |
|
For example, I've had a disagreement with another engineer there during a larger outage. We eventually got to the idea: If we click that button in the application, the database dies a violent death. Their first reaction was: So we never click that button again. My reaction was: We put all attention on the DB and click that button a couple of times.
If we can reliably trigger misbehavior of the system, we're back in control. If we're scared, we're not in control.