| HN Mirror

It all comes down to cost. Most of the time you can get kind monitoring for free or at a very low cost. AWS gives you a bunch of metrics out of the box for every product. Wrap your webapp in newrelic-agent and get a bunch of nice dashboards. But the more you want to monitor, the higher the costs are.

There's a lot of examples where you can catch something with monitoring, but it doesn't necessarily mean that you should.

A recent one from my memory: in a SaaS product a team shipped a bug that went unnoticed for a few days. It was feature flagged, so it only affected a small fraction of customers and didn't trigger any global alerts. Now, since it didn't trigger alerts, the natural post-mortem action plan was "better monitoring". That would mean monitoring and alerting on "rate of errors by customer" (or "rate of errors by endpoint by customer", I don't remember).

Given the usage pattern of the product, it was impossible to create a global monitor like that, we'd have manually configure it for each customer (and we had thousands of those). And even then, we'd inevitably be dealing with false positives every week.

The right action plan was to learn from failure, but do nothing. We got extremely unlucky during infrastructure update, shit happens. We don't need to build a complex monitoring system that catches one bug every 5 years.