Hacker News new | ask | show | jobs
by mjlawson 1385 days ago
I agree with this, mostly. I've worked at many a company where I've inherited the work of developers who built towards an expected view of the future. That expected future, of course, never quite lined up with the future that did come to pass. So a lot of extra work building out the wrong abstraction lead to a lot more work to undo it.

I would disagree with you only that I'm not familiar with a time when premature monitoring was added that caused me a lot of pain. If anything, the opposite has been true. Can you explain what you mean there?

1 comments

It all comes down to cost. Most of the time you can get kind monitoring for free or at a very low cost. AWS gives you a bunch of metrics out of the box for every product. Wrap your webapp in newrelic-agent and get a bunch of nice dashboards. But the more you want to monitor, the higher the costs are.

There's a lot of examples where you can catch something with monitoring, but it doesn't necessarily mean that you should.

A recent one from my memory: in a SaaS product a team shipped a bug that went unnoticed for a few days. It was feature flagged, so it only affected a small fraction of customers and didn't trigger any global alerts. Now, since it didn't trigger alerts, the natural post-mortem action plan was "better monitoring". That would mean monitoring and alerting on "rate of errors by customer" (or "rate of errors by endpoint by customer", I don't remember).

Given the usage pattern of the product, it was impossible to create a global monitor like that, we'd have manually configure it for each customer (and we had thousands of those). And even then, we'd inevitably be dealing with false positives every week.

The right action plan was to learn from failure, but do nothing. We got extremely unlucky during infrastructure update, shit happens. We don't need to build a complex monitoring system that catches one bug every 5 years.