Hacker News new | ask | show | jobs
by creer 978 days ago
Doesn't it depend on what is the likely reason for the outliers?

- A world with a different distribution than the one you are trying to fit

- A measurement environment subject to bad contacts or noise spikes or experimental mistakes

- A reporting system with occasional typos

- etc

Seems to me what to do with the outliers should be informed by some understanding of the environment. And in some case, noted aside while waiting to see if there is more data "out there" in the outliers' vicinity.

In some cases, a replacement for the outlier might be "nearby" while in other cases we know nothing about where the replacement should be.

1 comments

I used to design PID controllers where some of the analog readings would be actually bad. And grossly so.

Like 1002, 998, 1004, 48723, 2104, 1003, 997...

Estimating the deviation from the mean of the last n readings and ignoring ones too far out works well. Also calculate the percentage of bad readings and have a way of displaying it.

Right! In a case like this, some outliers are just useless for calculating the mean - they come from something else - BUT they might still be useful in a SMART style of trying to detect degradation of the system. A second measure.
My take about these things is you really need to understand what you are filtering for and what you are filtering against. If you don't you'll have a bad time.