| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yuliyp 1682 days ago
	Expecting an instant public post is a bit unrealistic. They had a post up just over 20 minutes after the incident start, which is not that crazy, given that they needed time to triage all of the alarms and understand which component was actually breaking and confirm some technically correct information around it, even if the actual internal incident response can run without the public post.

1 comments

corobo 1682 days ago

It's Google though.. they stop automating everything?

At least make the screen not show all green or something automatically

link

yuliyp 1682 days ago

Which things should not be green?

If the automation is working the services will be up. When an incident is happening it's because something is significantly broken, and automation won't properly understand what is and is not working.

For instance, lots of follow-on alarms might be firing for what are not actually issues with the things being monitored: As an example, I would imagine that datacenter temperatures and fan speeds dropped due to the incident, which might cause automation to suspect a facilities issue, but announcing a facilities issue would be misleading.

Or metrics around instances live might be tanking as autoscaling groups start downsizing. This would not be an issue with the autoscaling service, and automatically announcing an autoscaling outage would again be misleading.

In an incident, taking the available data and reaching a conclusion about what is broken and what are effects is something which requires skilled manual effort and is error-prone.

link

corobo 1681 days ago

> Which things should not be green?

The broken ones is how I usually do it.

The automation doesn't need to do that, it doesn't need to analyse the situation. It needs to communicate "Hey. Our systems have seen this and have pinged humans, bear with" rather than "nope even though half the internet is down rn, it's all good baby"

Make a green tick a blue questionmark or something. It doesn't even need to admit fault, it just needs to not be useless. My goal visiting the page is to get a link I can send clients "Updates will be posted here". Nothing more.

Also if you're hosting your monitoring system on the same system it's monitoring you've just completely missed the point. At least use a different region within your cloud provider, better would be completely different provider. I'd even go as far as using different domains/TLDs to host the page if I was Google sized

link

gowld 1682 days ago

Monitoring should be on a different system, unaffected by an outage in the monitored system.

link

yuliyp 1682 days ago

I think that's tangential to my point. The concerns in my post you replied to about system interdependence making it hard for a monitoring system to separate cause and effect, even if that monitoring system is itself working properly.

link