Hacker News new | ask | show | jobs
by ksajadi 1682 days ago
Half the internet is down because of a Google Cloud global issue on their load balancers, including Spotify and Etsy and GCP status is all green: https://status.cloud.google.com If you ever wondered why GCP is a distant third runner in the enterprise cloud space, here is your answer.
5 comments

Expecting an instant public post is a bit unrealistic. They had a post up just over 20 minutes after the incident start, which is not that crazy, given that they needed time to triage all of the alarms and understand which component was actually breaking and confirm some technically correct information around it, even if the actual internal incident response can run without the public post.
It's Google though.. they stop automating everything?

At least make the screen not show all green or something automatically

Which things should not be green?

If the automation is working the services will be up. When an incident is happening it's because something is significantly broken, and automation won't properly understand what is and is not working.

For instance, lots of follow-on alarms might be firing for what are not actually issues with the things being monitored: As an example, I would imagine that datacenter temperatures and fan speeds dropped due to the incident, which might cause automation to suspect a facilities issue, but announcing a facilities issue would be misleading.

Or metrics around instances live might be tanking as autoscaling groups start downsizing. This would not be an issue with the autoscaling service, and automatically announcing an autoscaling outage would again be misleading.

In an incident, taking the available data and reaching a conclusion about what is broken and what are effects is something which requires skilled manual effort and is error-prone.

> Which things should not be green?

The broken ones is how I usually do it.

The automation doesn't need to do that, it doesn't need to analyse the situation. It needs to communicate "Hey. Our systems have seen this and have pinged humans, bear with" rather than "nope even though half the internet is down rn, it's all good baby"

Make a green tick a blue questionmark or something. It doesn't even need to admit fault, it just needs to not be useless. My goal visiting the page is to get a link I can send clients "Updates will be posted here". Nothing more.

Also if you're hosting your monitoring system on the same system it's monitoring you've just completely missed the point. At least use a different region within your cloud provider, better would be completely different provider. I'd even go as far as using different domains/TLDs to host the page if I was Google sized

Monitoring should be on a different system, unaffected by an outage in the monitored system.
I think that's tangential to my point. The concerns in my post you replied to about system interdependence making it hard for a monitoring system to separate cause and effect, even if that monitoring system is itself working properly.
The big three cloud platforms all have this issue of delayed status updates. Why do you think it's just GCP?
...and there's a reason for that. Automations to update the status page are rarely acceptable, since the status page statuses have legal and financial implications. Therefore, the IM usually has to update it (or tell someone to update it). But, realistically, when you get paged, you first need to figure out what exactly is wrong and at least a vague idea of why. Then, you need to tell someone to update the page. Then, it gets updated.

The status page will always lag the outage. It's not a conspiracy.

Status pages should be driven that way, though. "legal and financial" implications and "It's not a conspiracy" is a poor excuse.

Now, I'm on Azure, but it seems like from the comments the situations are similar. So, instead of an automatically updated status page that would help engineers do their jobs, we get a status page that isn't accurate, and customers have pull teeth to get a service credit where/when one is due. And it seems like you can have the cake and eat it too here: while IANAL, a footnote in the SLA or the status page that "this is a machine estimate and not reflective of what goes into the SLA" should do it, no?

Not updating the status page, to avoid the legal and financial implications, is fraud -- taking money on false pretenses.
fraud? how? what guarantees do they make about timeliness of status updates on their services?
Also, in most teams, people who do external communication are different from those doing triage and troubleshooting.
Yeah, but they are still people who are responding to a page, working on wording and getting it approved, and then updating.

20 minutes seems pretty reasonable to me.

AWS typically has the same issue
Azure has the same issues with updating their status page. Sometimes it never happens.

I might at least hold out some chance that Google Cloud will write an interesting PM, which is something Azure would never do IME.

I'm old enough to remember the old claim that the internet was designed to cope with a nuclear attack.