Hacker News new | ask | show | jobs
by matryer 1371 days ago
Hey, thanks for your question. The tool keeps track of declaration and resolution times by watching when the status is changed. It also lets you manually specify when the incident really started, and when it actually ended. We can use this data to measure a few things, and watching how this changes over time helps us figure out if we're getting better or worse, on average. We want to be careful what we incentivise by default, and we're actively working on this area. The data is going to be available for people to build their own visualisations (in Grafana).

I'd be very interested to hear your thoughts too?

1 comments

Sorry for the delay, was traveling on holiday.

In short yeah, using the incident status as the implied times makes sense for the bulk of cases. Totally agree on picking out signal from the users inherent actions, but allowing them to provide more specific data when they know better.

Digging in a little further Im personally interested in moving past the incident data and inspecting the incoming alert(s) and related telemetry/metric/alarm data. For example think of the alarm definitions like “five 1m datapoints with a value above 0.1.” There’s a good argument to count impact (and incident duration) from that first datapoint > 0.1. Then theres the delta from metric processing to alert to incident creation. On the backend theres frequently a delta between mitigating impact and actual incident resolution, again I think getting back to the source alarm/alert/metric data would get us a more accurate view of operations and customer impact.