|
|
|
|
|
by edmundsauto
829 days ago
|
|
Yes, especially with something that hasn’t been tested out. You might see a 3% increase in eg latency. How do you know if that is a telemetry bug, a temporary effect in actual user experience, or a permanent regression? Teams will prioritize real regressions (if we don’t fix this big trouble b/c the UX is degraded). 2, should we wait a day or two for more data? Or 3, do we need to debug telemetry and pipelines? You also need to provide guidelines for what is not normal variation (in case 2+3 above). Is 3% ok, what is the effect on UX and/or business outcomes? And playbooks - “I am on call but don’t have deep expertise in this metric”. When/how do you escalate? What is the on call responsibility and what situation requires immediate escalation vs continuous monitoring. When do you declare a SEV? Observability tools are only as good as the process behind them. |
|