| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rco8786 1197 days ago

I don't have any PR concern or disingenuousness, reality is that this is an exceedingly hard task to actually pull off with little or no payoff in the end.

For it to work, you have to know exactly which monitors could go into exactly which states (or range of values, etc) such that it constitutes a user facing outage that should be displayed publicly.

So you have to have automation around: How many users are affected? HOW are they affected? Is our homepage down but APIs still work? Is 1 API having issues but the other 100 just fine? Is there just some elevated latency but everything still works fine, or is it so elevated that we would call it an incident? Is it an internal tool that's broken? Is this a real alert or just an incidental alert due to a planned failover or a deploy and the metric loss is not indicative of user experience? Etc, etc.

It's truly an impossibly hard problem.