Hacker News new | ask | show | jobs
by enriquto 699 days ago
> the latter is much easier: sample 10,000 elements uniformly at random and take their median

Do you have a source for that claim?

I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.

True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.

4 comments

> this does not mean that the median can be found accurately by your method.

You can do dynamic sampling: e.g. take double the samples, see what decimal in your result budges. Adjust.

Asymptotic properties of quantile estimators are widely studied [1]. The key is to have a sufficiently large sample size.

[1] Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580.

Yet, for any given distribution the sample size can be arbitrarily close to infinite. Unless I've missed something, I don't see the relevance.

If you want the n-9s rate of failure (eg., n=5, 99.999) for a system with a power-law performance distribution, you could be waiting much more than a billion samples to see a failure.

eg., 3E10 ms (30 bn samples) in a year, at 5-9s failure rate has 3E3 ms of failure (3 thousand samples) -- which will be lower given a sampling strategy.

Relevance: from the cited paper, the variance of the median estimator is proportional to 1/(n * f^2), where n is the sample size and f is the density at median.

Two observations: 1. With sufficiently large n, you can control your variance to an acceptable level. 2. The factor f is outside of your control. If your problem has a small density around the median, then you'll need to throw more samples to compensate for it.

I think your concern is about #2: you can always construct a pathological distribution to make the sampling-based approach unattainable. The paper provided guidance on what to expect when you apply the sampling-based approach.

the key word is “uniformly”. If your data is not uniformly distributed, then you just have to pick random values non-uniformly. There are many ways to do that, and once you have your algo you’ll be able to reliably find an approximation of the median much faster than you would find the actual median.

https://en.m.wikipedia.org/wiki/Non-uniform_random_variate_g...

I wasn't saying that you could get within 1% of the true median, I was saying you could find an element in the 49th to 51st percentile. In your example, the 49th percentile would be -90 and the 51st percentile would be 90, so the guarantee is that you would (with very high probability) find a number between -90 and 90.

That's a good point, though, that the 49th percentile and 51st percentile can be arbitrarily far from the median.