Hacker News new | ask | show | jobs
by jacques_chester 2297 days ago
I think some sort of anomaly detection would be useful in your case. There are a bunch of libraries floating about, I remember at least Netflix[1], Yelp and Datadog talking about them. There appears to be a really good links page available too[1]. You can also learn a lot from Forecasting Principles and Practice, which is free online[2]

I have previously pitched using a kind of SPC-for-metrics approach, with Nelson rules[3] to help surface metrics which are starting to move out of control. I think it would have the advantage over ML techniques that it's easy to understand.

My experience is that alerting thresholds are a very poor mechanism for managing systems. They just ossify past disasters and typically become noise. Alert fatigue renders them meaningless. If they're set by the manufacturer then the incentives are broken, they will favour false alerts in order to push legal responsibility onto the operator.

[0] https://github.com/Netflix/Surus

[1] https://github.com/yzhao062/anomaly-detection-resources

[2] https://otexts.com/fpp2/

[3] https://en.wikipedia.org/wiki/Nelson_rules

1 comments

thanks for the links.

We only create an alert if there is a problem the operator can solve, otherwise there is no point in waking them up at 3 AM, so if anything our thresholds are set as loose as possible instead of as tight as possible.

However there are many instances where the operator could be alerted earlier that the machine operation is abnormal. For example the stator windings are rated for operation up to 155 degrees C but the machine is lightly loaded for a long time, the ambient temperature is normal, and the windings are 140 degrees. No alert would be generated from the stator winding temperature but something is amiss.

I think this is the case where some ML/AI/hypeword techniques might be applicable, for the controller to know that based on half a dozen variables the expected value for other variables based on past operation.

You should take a look at http://riemann.io
I agree with focusing on actionable alerts during on-call hours. You might be able to have some kind of scheduled change in sensitivity.

One thing I've wondered in the past year is whether fuzzy logic would be useful. Your example is a really good case of linguistic variables -- "lightly loaded", "a long time", "normal temperature" and so on. These can be assembled into rules or tables that should fire more sensibly than exact threshold values.