| Mh, I work quite a bit in the OPs-side and monitoring and observability are part of my job, for a bit of time now too. I'll say: Effective observability, monitoring and alerting of complex systems is a really hard problem. Like, you look at a graph of a metric, and there are spikes. But... are the spikes even abnormal? Are the spikes caused by the layer below, because our storage array is failing? Are the spikes caused by ... well also the storage layer.. because the application is slamming the database with bullshit queries? Or maybe your data is collected incorrectly. Or you select the wrong data, which is then summarized misleadingly. Been in most of these situations. The monitoring means everything, and nothing, at the same time. And in the application case, little common industry wisdom will help you. Yes, your in-house code is slamming the database with crap, and thus all the layers in between are saturating and people are angry. I guess you'd add monitoring and instrumentation... while production is down. At that point, I think we're at a similar point of "Safety rules are written in blood" - "the most effective monitoring boards are found while prod is down". And that's just the road to find the function in code that's a problem. That's when product tells you how this is critical to a business critical customer. |