Hacker News new | ask | show | jobs
by remus 1292 days ago
I understand the frustration, but Im not convinced monitoring at large scale is that straightforward.

The core question is: what constitutes degraded service? Would you say a service is experiencing downtime every time a 500 response is served? If you're serving millions to billions of requests/sec it seems a bit disproportionate to marka service down after a single 500 error, so then you need to work out some kind of acceptable threshold.

What about latency? Again you're just going to draw a line in the sand somewhere.

You end up with this big mix of metrics that define service quality, so you then have a kind of meta problem of deciding which metrics you should alert users on. Get too trigger happy and it's going to cost you money and customer trust, and your customers are going to get alert fatigue when it turns out the issue you alerted them about was more of a false alarm. Set the bar too high and you'll have angry customers wondering wtf is going on.

All that to say I don't think there's a right answer.

2 comments

We were pretty liberal with posting to our status. page for years and thought it was The Right Thing to do. I still do, to a point.

But, what ended up happening was a competitor who didn't have a status page at all would use our status page against us in the sales process. They just never mentioned their lack of a status page to compare to.

This was the same competitor who went 100% down for ~4 days during the busiest month of the year and only posted updates to a private Facebook group. There was data loss that was never publicly admitted to.

So, yeah, we implemented reasonable boundaries on what constitutes a post to the status page. We also adopted a new status page provider that let us get more granular with categorizing posts, and allowing users to subscribe to only "urgent" channels that pertain to them.

Before 2003-ish Amazon used to have a static "gonefishing" page on www.amazon.com that was manually triggered during outages. Due to newspaper reporters writing scripts that would detect the GF pages they were removed and the site was allowed to just spew 500s for whatever segment of critical pages was busted.
Very fair but 45 min of an outage/disruption before manually updating public status is poor service and why is that acceptable for Aws to deliver to users