| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ksajadi 1682 days ago
	Half the internet is down because of a Google Cloud global issue on their load balancers, including Spotify and Etsy and GCP status is all green: https://status.cloud.google.com If you ever wondered why GCP is a distant third runner in the enterprise cloud space, here is your answer.

5 comments

yuliyp 1682 days ago

Expecting an instant public post is a bit unrealistic. They had a post up just over 20 minutes after the incident start, which is not that crazy, given that they needed time to triage all of the alarms and understand which component was actually breaking and confirm some technically correct information around it, even if the actual internal incident response can run without the public post.

corobo 1682 days ago

It's Google though.. they stop automating everything?

At least make the screen not show all green or something automatically

yuliyp 1682 days ago

Which things should not be green?

If the automation is working the services will be up. When an incident is happening it's because something is significantly broken, and automation won't properly understand what is and is not working.

For instance, lots of follow-on alarms might be firing for what are not actually issues with the things being monitored: As an example, I would imagine that datacenter temperatures and fan speeds dropped due to the incident, which might cause automation to suspect a facilities issue, but announcing a facilities issue would be misleading.

Or metrics around instances live might be tanking as autoscaling groups start downsizing. This would not be an issue with the autoscaling service, and automatically announcing an autoscaling outage would again be misleading.

In an incident, taking the available data and reaching a conclusion about what is broken and what are effects is something which requires skilled manual effort and is error-prone.

corobo 1681 days ago

> Which things should not be green?

The broken ones is how I usually do it.

The automation doesn't need to do that, it doesn't need to analyse the situation. It needs to communicate "Hey. Our systems have seen this and have pinged humans, bear with" rather than "nope even though half the internet is down rn, it's all good baby"

Make a green tick a blue questionmark or something. It doesn't even need to admit fault, it just needs to not be useless. My goal visiting the page is to get a link I can send clients "Updates will be posted here". Nothing more.

Also if you're hosting your monitoring system on the same system it's monitoring you've just completely missed the point. At least use a different region within your cloud provider, better would be completely different provider. I'd even go as far as using different domains/TLDs to host the page if I was Google sized

gowld 1682 days ago

Monitoring should be on a different system, unaffected by an outage in the monitored system.

yuliyp 1682 days ago

I think that's tangential to my point. The concerns in my post you replied to about system interdependence making it hard for a monitoring system to separate cause and effect, even if that monitoring system is itself working properly.

yupper32 1682 days ago

The big three cloud platforms all have this issue of delayed status updates. Why do you think it's just GCP?

mostdataisnice 1682 days ago

...and there's a reason for that. Automations to update the status page are rarely acceptable, since the status page statuses have legal and financial implications. Therefore, the IM usually has to update it (or tell someone to update it). But, realistically, when you get paged, you first need to figure out what exactly is wrong and at least a vague idea of why. Then, you need to tell someone to update the page. Then, it gets updated.

The status page will always lag the outage. It's not a conspiracy.

deathanatos 1682 days ago

Status pages should be driven that way, though. "legal and financial" implications and "It's not a conspiracy" is a poor excuse.

Now, I'm on Azure, but it seems like from the comments the situations are similar. So, instead of an automatically updated status page that would help engineers do their jobs, we get a status page that isn't accurate, and customers have pull teeth to get a service credit where/when one is due. And it seems like you can have the cake and eat it too here: while IANAL, a footnote in the SLA or the status page that "this is a machine estimate and not reflective of what goes into the SLA" should do it, no?

gowld 1682 days ago

Not updating the status page, to avoid the legal and financial implications, is fraud -- taking money on false pretenses.

yuliyp 1681 days ago

fraud? how? what guarantees do they make about timeliness of status updates on their services?

kazen44 1682 days ago

Also, in most teams, people who do external communication are different from those doing triage and troubleshooting.

cheeze 1682 days ago

Yeah, but they are still people who are responding to a page, working on wording and getting it approved, and then updating.

20 minutes seems pretty reasonable to me.

brown9-2 1682 days ago

AWS typically has the same issue

deathanatos 1682 days ago

Azure has the same issues with updating their status page. Sometimes it never happens.

I might at least hold out some chance that Google Cloud will write an interesting PM, which is something Azure would never do IME.

iso1631 1682 days ago

I'm old enough to remember the old claim that the internet was designed to cope with a nuclear attack.