Hacker News new | ask | show | jobs
by gambling8nt 6001 days ago
What Zed is saying when he notes that meta-statistics are normal is that, thanks to the central limit theorem, the average and standard deviation of data sets collected from the same underlying probability distribution (with convergent average and standard deviation) will tend to be normally distributed (in the limit approaching infinite sample size), even if the underlying system behavior is far from a normal distribution. In practice you work with finite sample sizes, so an underlying distribution sufficiently far from normal will result in a non-normal distribution of meta-statistics--but in most applications, these sort of pathological distributions are largely irrelevant.

Take our example of looking at response time for loading a web page. There is some finite point (say, 10 sec) beyond which we no longer care how much longer it takes. So instead of considering the distribution of response times t, we consider the distribution of min(t, 10 sec). This distribution only has support over a finite interval, so its meta-statistics normalize rapidly as you increase the number of trials.

Using this will under-report the actual standard deviation in the response time (which might, as you say, not even converge), since we've eliminated extremely low probability events with very high response time, but as a practical matter this is largely irrelevant--if these events are high enough probability for us to care we'll notice them anyway. The point of this exercise is not to perfectly ascertain the underlying distribution of t, it is to develop useful predictions for system behavior in practice.

2 comments

The calculated standard deviation from any finite sample size of a long tailed distribution (e.g. Pareto with alpha <= 2) will be off by a factor of infinity.The point is that, not only is the standard deviation irrecoverable in this case, but it's hardly the figure of merit if you do know it.
Except that in real life there are no distributions with support outside of a finite interval in space or time; there's always some point when you stop running the system...if some packets don't arrive by that point, you generally don't care how much longer they would have taken.
Except that the sampling distribution of the standard deviation is a scaled chi-square, not a normal. The central limit theorem is only for the mean, not any statistic that you might dream up. It's trivial to think of many that would not converge even with a windsorised response time.
Chi-squared distributions are well approximated by normal distributions close to the mean.

The point is not that arbitrary statistics will necessarily always be perfectly behaved (or even well behaved) on sampling data--it's that to make reasonably accurate predictions of system behavior, under certain practical conditions, these statistics are well-behaved, and an inexperienced statistician (as most people are) is less likely to make a gross error.

Practical conditions not including routing networks and the stock market, you may wish to add...
Real life situations have finite cutoffs in behavior that remove many pathological problems with certain statistical models.