Hacker News new | ask | show | jobs
by amelius 1533 days ago
If a quantity cannot be negative (such as a mass), then standard deviation isn't the best choice.

EDIT: Yes, because the Gaussian distribution extends to +/- infinity; davrosthedalek explains it best, below.

3 comments

A fair dice roll can only have positive values {1,2,3,4,5,6} but it has a clearly defined std deviation: sqrt(105/36) -- there's no clear reason this isn't the 'best choice' that's just a case of application.
The point about applications is mostly valid even if theoretically unsatisfying, but I think the thing about dice rolls is basically spurious.
You can calculate the mean μ and the standard deviation σ of a dice roll. You get μ=3.5, σ=sqrt(105/36)~=1.707... . It's not very similar to a Gaussian, but sometimes these numbers are useful anyway.

It's more interesting if you calculate the distribution of the sum of rolling 100 dices. It's easy to calculate, becuase μ=100*3.5=35, σ=sqrt(100*105/36)~=17.07... But now the distribution is very similar to a Gaussian with μ=100*3.5=35 and σ=sqrt(100*105/36)~=17.07... https://en.wikipedia.org/wiki/Central_limit_theorem They are not equal because the sum of the roll of 100 dices is bounded between 100 and 600 and the Gaussian is not bounded. For most applications, you can just use the Gaussian instead of the exact distribution.

The predicted value is so incredibly far from zero that you can pretend it's a truncated Gaussian and not see any actual difference in the results.

Alternate reply: Gaussian approximation to the binomial is perfectly valid in all sorts of cases.

What would be a better choice?
GP is probably referring to the coefficient of variation, sigma/mu (standard deviation divided by mean), which normalises out for example the unit of measurement.

However, the 7 here is basically (x - mu)/sigma, so it is normalised (in that sense), anyway.

No, I think the problem (in principle) is that "standard deviation" has a special meaning for Gaussian distributions, which extend to infinity in both directions. A quantity that has a fixed range has most likely an asymmetric distribution, so one would expect an asymmetric error bar as well. But for a sigma<<the value, it's often not a big concern.

A good example is efficiency measurements. I can't count how often I have seen students say something like: Our detector is 99%+-3% efficient. Obviously a detector can't be 102% efficient.

> "standard deviation" has a special meaning for Gaussian distributions,

I have a master's degree in statistics and this is the first I'm hearing about it.

> Our detector is 99%+-3% efficient. Obviously a detector can't be 102% efficient.

In the absence of any other context I'd guess that they're using an approximation to a confidence interval that might be perfectly fine if the estimated value was nearer the center of the allowable range.

Well, special in two senses: First, in the canonical formula for Gaussians, sigma appears directly. For the case at hand, the confidence limits associated with 1 sigma, 2 sigma etc. in physics match exactly the area under the curve for a Gaussian integrated +- said sigma around the mean. That's were that connection actually comes from, and a physicist will always think: Within 1 sigma? That's 67%.

Hearing 99+-3% is a very strong indication that the person used an incorrect way to determine the uncertainty, most likely by taking the square-root of counts. But you are right, if the efficiency would be around 50%, that approximation is not so bad.

What's wrong with saying "Our detector is 99%+-3% efficient," if they are giving the output of some procedure that constructs valid confidence intervals? The confidence intervals will trap the true value 95% of time (or whatever the confidence level is). If it does what it promises to do, I don't see the problem.
Because a 99+3=102 is not a valid upper interval bound. You cannot have >100% efficiency for a detector. Also, your expected value cannot be centered. So maybe 99+1-3 is a valid range (but I would be very suspicious if the bound includes 100%)
I agree 102% is not a possible value for the efficiency of the detector. But if the confidence interval traps the true value of the efficiency 95% of the time upon repeated sampling, what's the problem? That's all that's required for a confidence interval to be valid. Some CI constructions do in general give intervals that include impossible parameter values, but if they contain the true value 95% of the time, there's no issue. The coverage guarantee is all that matters.

(One should not confuse a CI with a range of plausible values, in other words.)