| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gjm11 5767 days ago

So, here's what this is about.

One-sentence summary: If you're monitoring something to look for problems, you shouldn't treat each observation independently; multiple somewhat-low or somewhat-high observations may be a sign of trouble even if each on its own isn't enough to worry about. In more detail:

Suppose you have some number you're monitoring. It might be network latency, number of customer signups, temperature, fraction of your email that's spam, whatever. You would like to be notified if it starts behaving unexpectedly -- maybe your network is down, someone just trashed your company in the media, a fan has failed, or your spam filter has gone nuts.

There's a technique called Holt-Winters forecasting, which looks at historical data and assumes it's made up of something constant, something periodic (e.g., daily variation), and noise; it generates predictions, which include a measure of uncertainty as well as a predicted value. Then some guy called Brutlag developed a way to compare observations with Holt-Winters predictions from the past, and determine whether each new observation is suspect.

However, Brutlag's analysis basically treats each new measurement independently. So, e.g., suppose you have a number that's always non-negative (number of customer signups, say), and suppose the H-W prediction says that a value of 0 isn't too improbable. Then Brutlag's approach will not complain even if from some point onward every single measurement is 0 -- because each one, on its own, is reasonably plausible.

Evan Miller has a more sophisticated way of looking for anomalies. Each time a new observation comes in, he looks at the plausibility of that observation, just like Brutlag does; but he also tries adding up the last N observations and comparing them with expectations for the sum of N consecutive observations, for N=2,3,...T (for some suitably chosen limit T). So if you get, say, a lot of zeros, they may not be very implausible on their own, but getting five zeros in a row might be enough to trigger a warning.

Miller gives an example where IMVU caught a network problem using this technique -- they were watching the number of customers who invited contacts to open an account -- which wouldn't have been caught by the Brutlag method, for exactly the reason given above: they had a run of quite-low measurements, but none of them on its own was low enough for the Brutlag method to complain, because Brutlag's lower confidence limit was zero.