Hacker News new | ask | show | jobs
by falcolas 2561 days ago
Status pages are not driven by automation, they are driven by PMs. Amazon, Google... all of the big players work them this way.

And I'm not surprised, since actually reporting it as down has a lot of political blowback (not to mention contract blowback) within the company.

1 comments

Also, accurately reporting about an arbitrary source of downtime means you're smart enough to avoid the same sources of downtime.

Not that this can't have been an obvious reason (deleting all the servers in a datacenter or similarly trivial but severe) but it's likely impossible to ensure status page accuracy.

You could just ping the servers once a minute and tell if they're up or not. No need to know why they've gone down.
That only indicate the frontend of the service is up and potentially running. Being about to respond to ping and being able to serve HTTP request are two different things, and being able to serve HTTP request vs a fully functional website are two different things. Think about wrong SSL certification, wrong domain mapping between frontend/backend, broken JS/CSS etc.
This outage is a great example. I can ping Google Calendar servers and I get an HTTP response. SSL also works like a charm.

And yet everybody agrees it's down.

Most outages aren't so obvious as this one, and any ping will fail intermittently (often because the ping agent has a failure.) Google definitely has loads of pings hitting Google Calendar in various ways. Exposing this monitoring to the public is not practical or really useful. (And would aid would-be attackers.)