Hacker News new | ask | show | jobs
by capableweb 1205 days ago
> To be successful this would require a very highly polished set of monitors, otherwise common situations like alert flapping would end up spamming updates. Having a human in the loop increases latency but can also increase the quality of the communication.

Preventing alert flapping is hardly "very highly polished", it's the base level for having even internal monitoring. And if it works for you internally, you could definitely make some automatic information available.

It's not rocket science we're talking about. Having a status page that can automatically say "Everything OK" or "Something seems to be not OK" would even be enough, but most companies don't offer something like that.

2 comments

> Having a status page that can automatically say "Everything OK" or "Something seems to be not OK" would even be enough, but most companies don't offer something like that.

Any medium to large sized company will, at any given time, have dozens or hundreds of alerts going off. How do you know which ones should equal an "outage" on status.myco.com?

edit I just checked, my company currently has 2079 monitors in an ALERT status. Some of those are probably real ongoing issues, maybe even some customer facing issues...but there does not appear to be any reason to update our status page.

Thanks for replying, your assertion that this is not rocket science causes me to believe that your experiences do not overlap with mine. Alert flapping is a trivial example of a service health indicator being misaligned with reality. If the indicators you work with are always comprehensive, accurate, and precise that’s wonderful but it’s not a plausible reality for complex systems [1].

[1] https://how.complexsystems.fail/#5