Hacker News new | ask | show | jobs
IMVU: Detect failure with statistics (evanmiller.org)
49 points by sstone 5767 days ago
3 comments

So, here's what this is about.

One-sentence summary: If you're monitoring something to look for problems, you shouldn't treat each observation independently; multiple somewhat-low or somewhat-high observations may be a sign of trouble even if each on its own isn't enough to worry about. In more detail:

Suppose you have some number you're monitoring. It might be network latency, number of customer signups, temperature, fraction of your email that's spam, whatever. You would like to be notified if it starts behaving unexpectedly -- maybe your network is down, someone just trashed your company in the media, a fan has failed, or your spam filter has gone nuts.

There's a technique called Holt-Winters forecasting, which looks at historical data and assumes it's made up of something constant, something periodic (e.g., daily variation), and noise; it generates predictions, which include a measure of uncertainty as well as a predicted value. Then some guy called Brutlag developed a way to compare observations with Holt-Winters predictions from the past, and determine whether each new observation is suspect.

However, Brutlag's analysis basically treats each new measurement independently. So, e.g., suppose you have a number that's always non-negative (number of customer signups, say), and suppose the H-W prediction says that a value of 0 isn't too improbable. Then Brutlag's approach will not complain even if from some point onward every single measurement is 0 -- because each one, on its own, is reasonably plausible.

Evan Miller has a more sophisticated way of looking for anomalies. Each time a new observation comes in, he looks at the plausibility of that observation, just like Brutlag does; but he also tries adding up the last N observations and comparing them with expectations for the sum of N consecutive observations, for N=2,3,...T (for some suitably chosen limit T). So if you get, say, a lot of zeros, they may not be very implausible on their own, but getting five zeros in a row might be enough to trigger a warning.

Miller gives an example where IMVU caught a network problem using this technique -- they were watching the number of customers who invited contacts to open an account -- which wouldn't have been caught by the Brutlag method, for exactly the reason given above: they had a run of quite-low measurements, but none of them on its own was low enough for the Brutlag method to complain, because Brutlag's lower confidence limit was zero.

Caveat: I only skimmed the paper.

The combination of two things set off alarm bells: firstly, the problem observed is that a continuous sequence of zero readings is erroneously treated as an "ok". Second, the variance is modelled as a normal distribution (Section 3.1). Since the normal is symmetric, if the mean is sufficiently close to zero readings below zero will be within one standard deviation. You can't ever have readings below zero in the type of systems under consideration. It seems this is the flaw in the original work, and furthermore this assumption is carried through to the fix (with some ad-hoc modifications [remember, I only skimmed the paper]). It seems to me that a cleaner model would drop this assumption of normality and use an asymmetric distribution (say, the Poisson) in its place. I would be interested in any comments from those who read the paper in more depth.

This is precisely the approach they take. Section 3.4 ("The Model") begins, "The heart of the model is to treat incoming events as a Poisson process..."
This is somewhat tangential, but might help if other people are attempting to do similar things in the future...

A relation of mine is a pharmacist at a hospital. At his workplace, they have automated drug dispensing machines that can be used by employees to obtain medication for patients - saves a lot of time and work for dispensing normal stuff like painkillers, etc.

These machines use a statistical method to flag when an employee is pulling out more than the usual amount of medicine, as there are infrequent cases of employees selling/using it themselves.

The machines were programmed to use standard deviation for this - if what you draw is within 2 standard deviations of the mean of all users, you're fine.

The problem is that, on one occasion, an employee in a small section was a junkie and pulled out so much of a certain drug that the mean was skewed to the point that they were still within 2 std dev of it, and it wasn't noticed for a few months.

So, to wrap up the story, you probably don't want to purely use statistics to test for failure. You want some basic sanity checks in there.

"All the models are wrong but some are useful". I think that in this case the model was just wrong... or a bit simplistic.