|
|
|
|
|
by gkop
1651 days ago
|
|
At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step. 1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day. |
|
They've discovered it right away, the Service Health Dashboard was not updated. source: link.