Hacker News new | ask | show | jobs
by numlocked 3814 days ago
These are some cool SQL tricks! I like it.

The big caveat with the standard deviation technique is that it assumes a normal distribution. Many datasets are not actually distributed normally (power-law, Poisson, beta, etc, etc) and so the technique won't work. It's a much harder problem to 'generically' detect outliers without knowledge of the underlying distribution.

I don't have any idea how to do it (though a former colleague came up with nice idea of building a histogram and searching for values that occurred after some number of empty bins, implying an outlier). Is there an accepted state-of-the-art for general-purpose outlier detection? Or is that such a broad question as to be meaningless?

3 comments

Other than normal distribution assumption, there is another assumption that doesn't hold true for most time series related to human activity, nature, or scheduling. If heteroscedasticity is present, you cannot use the same standard deviation for the entire series. A more practical approach is to compute variance for each calendar period separately.

Here's an example - expected variance for the number of SWIFT payments processed during non-banking hours is 0. Transaction counter greater than 0 is an outlier.

This is a very good point (to me as someone without any stats knowledge).

I think when the article begins it should first provide code that can validate that the values you're going to give the function will fit within a normal distribution that makes the outlier detection worthwhile. Is that possible?

Your colleague is describing something similar to kernel density estimation, which would be your first port of call google-wise.