Hacker News new | ask | show | jobs
by tetha 890 days ago
Mh, I work quite a bit in the OPs-side and monitoring and observability are part of my job, for a bit of time now too.

I'll say: Effective observability, monitoring and alerting of complex systems is a really hard problem.

Like, you look at a graph of a metric, and there are spikes. But... are the spikes even abnormal? Are the spikes caused by the layer below, because our storage array is failing? Are the spikes caused by ... well also the storage layer.. because the application is slamming the database with bullshit queries? Or maybe your data is collected incorrectly. Or you select the wrong data, which is then summarized misleadingly.

Been in most of these situations. The monitoring means everything, and nothing, at the same time.

And in the application case, little common industry wisdom will help you. Yes, your in-house code is slamming the database with crap, and thus all the layers in between are saturating and people are angry. I guess you'd add monitoring and instrumentation... while production is down.

At that point, I think we're at a similar point of "Safety rules are written in blood" - "the most effective monitoring boards are found while prod is down".

And that's just the road to find the function in code that's a problem. That's when product tells you how this is critical to a business critical customer.

1 comments

Running voodoo analysis on graph spikes is indeed a fool’s errand. What you really need is load testing on every component of your system, and alerts for when you approach known, tested limits. Of course this is easier said than done and things will still be missed, but I’ve done both approaches and only one of them had pagers needlessly waking me in the middle of the night enough to go on sleepless swearing rants to coworkers.
Yeh. Or, during a complex investigation, you need to setup a hypothesis explaining these spikes in order to eventually establish causality surrounding these spikes. And once you have causality, you can start fixing.

For example, I've had a disagreement with another engineer there during a larger outage. We eventually got to the idea: If we click that button in the application, the database dies a violent death. Their first reaction was: So we never click that button again. My reaction was: We put all attention on the DB and click that button a couple of times.

If we can reliably trigger misbehavior of the system, we're back in control. If we're scared, we're not in control.