Hacker News new | ask | show | jobs
by Alligaturtle 979 days ago
I'm not sure I would feel comfortable using the Winsorized mean -- it doesn't have any particular statistical properties, and it lacks any intuition appeal because it's not clear what the value represents.

I can understand a line of logic that would give rise to something like the Winsorized mean -- after you look at your data, you see some obvious outliers. It feels dirty to just drop those values (which would lead to the truncated mean) because the information from an implausible value is more likely to be near the extreme than it is to be near the center mass.

What to do with those extreme values?

Here's something I now want to experiment with -- bootstrapping the extreme values. Take note of the original empirical distribution. Then, create a new distribution by removing the top and bottom X% of the observations and replacing them with values drawn i.i.d. from the original empirical distribution. This could lead to some values being replaced with the outliers that we originally wanted to drop. After we do this, record the mean. Then create new sample distributions until we have a distribution of new means. What I am curious about is how the shape if this distribution of means will be impacted depending on that X% value selected at the beginning.

What are some well-known distributions that appear to have outliers? A log-normal distribution maybe?

6 comments

My surface level take is they are similar to M-estimators. Whereas M-estimators are more mathematically rigorous, Windsorized metrics might be easier to compute manually.

It does feel like it's a very early 20th century technique. Nowadays we have so many tools which would be less feasible for calculators (the people) and more feasible for software.

https://en.m.wikipedia.org/wiki/M-estimator

I feel stasticians and econometricians tend to take the mean of the log of the distribution.

Recently, we started using the arcsinh instead of the log as well because the function has nice properties[1]

1. https://worthwhile.typepad.com/worthwhile_canadian_initi/201...

The reason why the arsinh transformation is useful (and this is not mentioned in the link you posted) is that it is the optimal variance-stabilizing transformation [1] under the assumption that your data is contaminated by a mixture of additive and multiplicative noise (the same way that the log transformation is the optimal variance-stabilizing transformation when your data is contaminated only by multiplicative noise).

Read the Wikipedia article for a more formal explanation.

[1] https://en.m.wikipedia.org/wiki/Variance-stabilizing_transfo...

Is taking logs (or arcsinh or whatever) really all that good an idea if (a) you don't have a good physical model justifying it or (b) your data spans several orders of magnitude?
Yes in general.

It makes nonlinear relationships linear. Makes the model less sensitive, too. For instance if the data spans several OoM, adding or removing one datapoint in one of those orders can generate a lot of skew before the log-linearization.

It's easy to cast the log back to the original distribution by taking the exponent afterwards.

As far as I understand directly transforming your data can lead to problems. In any case, its what link functions do better in generalized linear models[1].

[1] https://en.m.wikipedia.org/wiki/Generalized_linear_model

I think it makes a lot of sense. The best replacement for an outlier is the closest thing to the outlier that isn't an outlier. Resampling doesn't make a lot of sense because your new point is completely disconnected from your old one.

I don't really like the Winsorized mean but not for the reason you list. I think the main issue is that you are assuming exactly the top and bottom 10% are outliers instead of looking at the actual data distribution to see what the outliers are then using a similar replacement technique on only the outliers.

> What are some well-known distributions that appear to have outliers? A log-normal distribution maybe?

All of them. Which is why outside of a handful of contexts, the consensus in statistical modeling nowadays seems to be not to worry about outliers unless the values are completely unreasonable or there are a suspicious amount of them.

As to your bootstrapping idea, why not bootstrap the entire distribution, why only the tails? If you only bootstrap the tails but allow draws from the entire empirical distribution then you are changing the underlying distribution.

it makes sense and it is better estimator of the expected value than the sample mean in some cases:

https://projecteuclid.org/journals/annals-of-statistics/volu...

The tradeoff is that while there are many estimators of the expected value that are more efficient (have less variance) than the sample mean for various distributions, they typically introduce bias.
Is "some cases" closer to 1% or 50+%? That paper is behind a paywall.

How does it do on normal and other common distributions?

https://arxiv.org/abs/1907.11391 it's pretty much the same thing
It seems obvious that for symmetrical distributions, like the normal, winsorizing does not change the mean.
Was looking for an answer for other common distributions, or a one-line summary of the paper.
but it can converge faster to the mean with the sample size...
Doesn't it depend on what is the likely reason for the outliers?

- A world with a different distribution than the one you are trying to fit

- A measurement environment subject to bad contacts or noise spikes or experimental mistakes

- A reporting system with occasional typos

- etc

Seems to me what to do with the outliers should be informed by some understanding of the environment. And in some case, noted aside while waiting to see if there is more data "out there" in the outliers' vicinity.

In some cases, a replacement for the outlier might be "nearby" while in other cases we know nothing about where the replacement should be.

I used to design PID controllers where some of the analog readings would be actually bad. And grossly so.

Like 1002, 998, 1004, 48723, 2104, 1003, 997...

Estimating the deviation from the mean of the last n readings and ignoring ones too far out works well. Also calculate the percentage of bad readings and have a way of displaying it.

Right! In a case like this, some outliers are just useless for calculating the mean - they come from something else - BUT they might still be useful in a SMART style of trying to detect degradation of the system. A second measure.
My take about these things is you really need to understand what you are filtering for and what you are filtering against. If you don't you'll have a bad time.
It is a very useful technique in robust statistics. I've personally used winsorzied mean in robust optimization.