| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gkop 1651 days ago
	At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step. 1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.

1 comments

> our policy is if the customer reported the issue before our internal monitoring caught it

They've discovered it right away, the Service Health Dashboard was not updated. source: link.

They don’t say explicitly right away do they? I skimmed twice.

But yes you’re right, there’s no reason to question their monitoring or alerting specifically.