| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by donavanm 1373 days ago

Sorry for the delay, was traveling on holiday.

In short yeah, using the incident status as the implied times makes sense for the bulk of cases. Totally agree on picking out signal from the users inherent actions, but allowing them to provide more specific data when they know better.

Digging in a little further Im personally interested in moving past the incident data and inspecting the incoming alert(s) and related telemetry/metric/alarm data. For example think of the alarm definitions like “five 1m datapoints with a value above 0.1.” There’s a good argument to count impact (and incident duration) from that first datapoint > 0.1. Then theres the delta from metric processing to alert to incident creation. On the backend theres frequently a delta between mitigating impact and actual incident resolution, again I think getting back to the source alarm/alert/metric data would get us a more accurate view of operations and customer impact.