Hacker News new | ask | show | jobs
by topaztee 831 days ago
“is this a real world effect, if so, should we be worried?”

can you expand on that? the way im translating that is teams are spending lots of time checking if something is broken or not

1 comments

Yes, especially with something that hasn’t been tested out. You might see a 3% increase in eg latency. How do you know if that is a telemetry bug, a temporary effect in actual user experience, or a permanent regression?

Teams will prioritize real regressions (if we don’t fix this big trouble b/c the UX is degraded). 2, should we wait a day or two for more data? Or 3, do we need to debug telemetry and pipelines?

You also need to provide guidelines for what is not normal variation (in case 2+3 above). Is 3% ok, what is the effect on UX and/or business outcomes?

And playbooks - “I am on call but don’t have deep expertise in this metric”. When/how do you escalate? What is the on call responsibility and what situation requires immediate escalation vs continuous monitoring. When do you declare a SEV?

Observability tools are only as good as the process behind them.

those are some great points. I suppose for each company those numbers will differ (0.1 for amazon is massive) so would be hard to productize