| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by creer 978 days ago

Doesn't it depend on what is the likely reason for the outliers?

- A world with a different distribution than the one you are trying to fit

- A measurement environment subject to bad contacts or noise spikes or experimental mistakes

- A reporting system with occasional typos

- etc

Seems to me what to do with the outliers should be informed by some understanding of the environment. And in some case, noted aside while waiting to see if there is more data "out there" in the outliers' vicinity.

In some cases, a replacement for the outlier might be "nearby" while in other cases we know nothing about where the replacement should be.

1 comments

Gibbon1 978 days ago

I used to design PID controllers where some of the analog readings would be actually bad. And grossly so.

Like 1002, 998, 1004, 48723, 2104, 1003, 997...

Estimating the deviation from the mean of the last n readings and ignoring ones too far out works well. Also calculate the percentage of bad readings and have a way of displaying it.

link

creer 978 days ago

Right! In a case like this, some outliers are just useless for calculating the mean - they come from something else - BUT they might still be useful in a SMART style of trying to detect degradation of the system. A second measure.

link

Gibbon1 978 days ago

My take about these things is you really need to understand what you are filtering for and what you are filtering against. If you don't you'll have a bad time.

link