Hacker News new | ask | show | jobs
by callalex 885 days ago
Running voodoo analysis on graph spikes is indeed a fool’s errand. What you really need is load testing on every component of your system, and alerts for when you approach known, tested limits. Of course this is easier said than done and things will still be missed, but I’ve done both approaches and only one of them had pagers needlessly waking me in the middle of the night enough to go on sleepless swearing rants to coworkers.
1 comments

Yeh. Or, during a complex investigation, you need to setup a hypothesis explaining these spikes in order to eventually establish causality surrounding these spikes. And once you have causality, you can start fixing.

For example, I've had a disagreement with another engineer there during a larger outage. We eventually got to the idea: If we click that button in the application, the database dies a violent death. Their first reaction was: So we never click that button again. My reaction was: We put all attention on the DB and click that button a couple of times.

If we can reliably trigger misbehavior of the system, we're back in control. If we're scared, we're not in control.