Hacker News new | ask | show | jobs
by harperlee 1059 days ago
I’m not a statistician, but I’ve read somewhere the argument that the gaussian is the distribution that assumes the least about its data (just that there’s a mean and non-infinite variance) so it is typically safe to use when you know little about the real distribution.

(I’m just commenting to compel someone to correct me and expand on this subthread really :) )

3 comments

Good tactic!

> so it is typically safe to use when you know little about the real distribution.

That was what the quants doing risk assessment at the big banks thought pre-2008, which is the other context I associate with the n-sigma notation for probability

So what OP is missing is the central limit theorem[1]. According to which any sum diatribution of independently-identically-distributed random variables in the limit becomes gaussian distributed.

So taking it apart, given some restrictions, any sum of randomly distributed data is gaussian distributed.

If you take an average of some value. E.g. sea ice extent at fixed date x, e.g. January 1st every year, you have a sum distribution.

So you're not talking about the random distribution of any date, but of the random distribution of the average value. Only this has the Gaussian distribution.

Now there are some restrictions IID - independently identically distributed. This is the part the quants got wrong. Identical distribution is usually not the issue, we can, for a certain timespan, assume that the random distribution stays roughly the same.

But independent was the issue. If one event is correlated with the next, the central limit theorem may hold for a bit, but if the correlation is, too extreme will break down, like in the quant models of yore.

Their estimates for the housing market risk were ok as long as the credit defaults were not highly correlated, but as soon as the crisis started some vicious cycles formed between the foreclosures and tumbling house prices causing more foreclosures.

The models broke down.

Back to the ice sheats, if we assume the melting of the ice sheats won't increase (or decrease) the melting of the ice sheats we're good. I don't know about causative mechanisms here, but it could be that the models do break down in these times of extreme change.

That doesn't mean that the extreme change is nothing to worry about, since only by being extreme might it break the models.

[1] https://en.m.wikipedia.org/wiki/Central_limit_theorem

Maybe you can make an argument that, in the absence of any information, your best bet is assuming a Gaussian distributon, but it definitely is not safe to assume so. Your data might not be symmetrically or even unimodally distributed and making these assumption can lead to completely wrong conclusions.
If you know that your data has a well-defined mean and standard deviation but you know nothing else about it then you start with a Gaussian distribution. This isn't an assumption. The Gaussian distribution has the highest entropy and hence encodes the least information possible about the data. Then as you learn more about your data, you would update this distribution using Bayes' theorem. This could give rise to skew or multiple maxima.
I would consider a well-defined mean and standard deviation an assumption. The distribution of maximum entropy is determined by the constraints. Those constraints have to be assumed. If you constrain your problem to only have nonzero probabilities in a fixed interval, then a uniform distribution will have maximum entropy.
I agree that there is an assumption that the data are well described by a distribution on R instead of some interval [a,b]. I don't know how well justified this is. The assumption of well defined mean and stddev is weak and better supported each time you collect more data. If your stddev is ill defined then you'll find your sample stddev will diverge (increase) as you add more data points.
In this case, as each day is highly correlated to the previous one, it is safe to say that the distribution of daily sea ice variation is probably not Gaussian? (Though obviously, this is very bad...)
The comparison is between ice extent measurements at the same date, at differing years, so the day-to-day correlation is not the relevant metric here, but the year-to-year correlation, which should be very low.
There's still something funny about quoting tail probabilities converted to once in X years, though, isn't there? Maybe I'm thinking about this wrong...