| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by edmundsauto 829 days ago

Yes, especially with something that hasn’t been tested out. You might see a 3% increase in eg latency. How do you know if that is a telemetry bug, a temporary effect in actual user experience, or a permanent regression?

Teams will prioritize real regressions (if we don’t fix this big trouble b/c the UX is degraded). 2, should we wait a day or two for more data? Or 3, do we need to debug telemetry and pipelines?

You also need to provide guidelines for what is not normal variation (in case 2+3 above). Is 3% ok, what is the effect on UX and/or business outcomes?

And playbooks - “I am on call but don’t have deep expertise in this metric”. When/how do you escalate? What is the on call responsibility and what situation requires immediate escalation vs continuous monitoring. When do you declare a SEV?

Observability tools are only as good as the process behind them.

1 comments

topaztee 827 days ago

those are some great points. I suppose for each company those numbers will differ (0.1 for amazon is massive) so would be hard to productize

link