Hacker News new | ask | show | jobs
by npk 6226 days ago
The statement "only has meaning if the distribution is presumed to be normal" is wrong. The SD is a summary of the spread of a distribution. In fact, for most centrally concentrated distributions (including a uniform one) +/- 1 sigma corresponds to about 60% of the mass of the distribution. This is an amazingly useful thing to know.

As the above triva factoid points out, the standard deviation is an important summary statistic. More interestingly by using mean, variance (or sd), skew, and kurtosis, you can describe almost any centrally concentrated distribution. Even distribution with heavy tails.

I think what the OP meant is that most 3+ sigma results are not truly 3+ sigma, because most distributions in this world are not gaussian, but instead have large wings. SD is most useful when you know what the underlying distribution is. Currently it's more in fashion to communicate spread using confidence intervals because they presume less about the underlying distribution.

1 comments

You're right. I was being sloppy.

I should have said something more like "the standard deviation calculated from a sample set is only generally applicable in so far as one is willing to make assumptions that the sample set is representative of the distribution as a whole". The default assumption in traditional statistics (such as quoting p-values) is that the distribution is normal, and in real world situations often not the case.

Your restatement is right on, although I'd go farther and say that standard deviations (and confidence intervals) are only useful metrics with regard to the particular assumptions one is willing to make about underlying distribution. Yes, you can calculate these measures, but they won't help you if your assumptions are irreparably flawed.