Hacker News new | ask | show | jobs
by abnry 1738 days ago
One way to think about why we tend to use averages instead of medians is that it is related to a really deep theorem in probability: The Central Limit Theorem.

But I think we can twist our heads and see in a way that this is backwards. Mathematically, the mean is much easier to work with because it is linear and we can do algebra with it. That's how we got the Central Limit Theorem. Percentiles and the median, except for symmetric distributions, are not as easy to work with. They involve solving for the inverse of the cumulative function.

But in many ways, the median and percentiles are a more relevant and intuitive number to think about. Especially in contexts where linearity is inappropriate!

3 comments

i think of it as: if the data is gaussian, use a mean, otherwise go non-parametric (medians/percentiles).

or put another way, if you can't model it, you're going to have to sort, or estimate a sort, because that's all that's really left to do.

this shows up in things from estimating centers with means/percentiles to doing statistical tests with things like the wilcoxon tests.

Assume up front none of your measured latencies from a software networked system will be Gaussian, or <exaggereation> you will die a painful death </exaggeration>. Even ping times over the internet have no mean. The only good thing about means is you can combine them easily, but since they are probably a mathematical fiction, combining them is even worse. Use T-Digest or one of the other algorithms being highlighted here.
This is why I try to plot a proper graph of times from any "optimization" I see in a PR. Too many times I see people making this assumption for example, and even if they're right they usually forget to take the width of the gaussian into account (i.e. wow your speedup is 5% of a standard deviation!)
yep, have made that mistake before. even turned in a write-up for a measurement project in a graduate level systems course that reported network performance dependent measurements with means over trials with error bars from standard deviations.

sadly, the instructor just gave it an A and moved on. (that said, the amount of work that went into a single semester project was a bit herculean, even if i do say so myself)

> the median and percentiles are a more relevant and intuitive number to think about

A good example of this is when people say "more than half of drivers say they are above average drivers! Idiots!" Of course that's perfectly possible if most drivers are basically fine and some are really really bad. For example, 99.99% of people have greater than the average number of legs.

The correct impossible statement would be if more than half of drivers are better than the median driver.

It's more related to the law of large numbers than the CLT