| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by justinpombrio 699 days ago
	Did you actually need to find the true median of billions of values? Or would finding a value between 49.9% and 50.1% suffice? Because the latter is much easier: sample 10,000 elements uniformly at random and take their median. (I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)

4 comments

rented_mule 699 days ago

The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.

Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.

jiggawatts 698 days ago

Speaking of "single pass", one of the criticisms I have of the "enumerator" patterns in modern programming languages is that they encourage multiple passes.

As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.

I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"

_dain_ 698 days ago

What's wrong with doing it in two passes? N iterations each doing 2 operations is exactly the same cost as 2*N iterations each doing 1 operation. Because multiplication is commutative.

buildbot 698 days ago

This is true, but a big advantage of the single pass method is data reuse. Instead of loading each element twice, you load each element once. Per bit, reading/writing to external memory is massively more energy intensive than any compute op. Orders of magnitude more.

srcreigh 698 days ago

It’s more like 2NM where M is loading the data from disk/memory. One pass is 2N+M.

Why go to the store and back twice to buy two things instead of buying two things in one trip? ;p

jiggawatts 697 days ago

I think you meant 2N+2M vs 2N+M, but yes, that’s the point: not reading twice is cheaper because unlike in traditional big-O analysis compute is cheap but data access is very expensive.

srcreigh 696 days ago

Yep, thx.

davidmurdoch 698 days ago

Does C# have the plumbing for this built in? It's been 7 years since using it so I might not be remembering correctly.

neonsunset 698 days ago

IEnumerable<T>.Max and .Min are the same as they were just significantly faster through the use of SIMD: https://github.com/dotnet/runtime/blob/ebbebaca1184940f06df6...

You could implement a similar (and simpler) fast path for types with contiguous memory by performing min and max per iteration instead.

davidmurdoch 697 days ago

I was thinking LINQ was optimized to do a single enumeration for cases such as this. Can LINQ be easily, naturally, and idiomatically used with `IEnumerable<T>.Max` and `.Min` to avoid looping twice, or is it more of a more rigorous and intentional optimization the engineer would have to seek out?

cerved 698 days ago

what's preventing you from doing it in one pass?

sroussey 698 days ago

Actually, seeking the bias numbers can be quite illuminating.

rented_mule 698 days ago

You're absolutely right! Some learning that we came to later that isn't unrelated to what you're saying... don't just look at metrics (in the case I've described above, it was timings of operations in a large system), but look at histograms for them. You should be able to explain why they have the shape they do. The distributions are so often multimodal, and understanding the modes helps you understand a lot more nuance about your system.

We were an infra-focused team building up a scalable data handling / computation platform in preparation to apply ML at a non-trivial scale. When we finally hired some people with deep stats knowledge 10 or 12 years ago, there was a lot of great learning for everyone about how each of our areas of expertise helped the others. I wish we'd found the right stats people earlier to help us understand things like this more deeply, more quickly.

sroussey 697 days ago

Even some regular engineers with experience dealing with many servers will have built up an intuition to exploit.

In fact, this is often where decisions on what metrics to have in the first place come from. Ask why. You can go far without deep stats knowledge!

enriquto 699 days ago

> the latter is much easier: sample 10,000 elements uniformly at random and take their median

Do you have a source for that claim?

I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.

True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.

B1FF_PSUVM 698 days ago

> this does not mean that the median can be found accurately by your method.

You can do dynamic sampling: e.g. take double the samples, see what decimal in your result budges. Adjust.

rhymer 698 days ago

Asymptotic properties of quantile estimators are widely studied [1]. The key is to have a sufficiently large sample size.

[1] Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580.

mjburgess 698 days ago

Yet, for any given distribution the sample size can be arbitrarily close to infinite. Unless I've missed something, I don't see the relevance.

If you want the n-9s rate of failure (eg., n=5, 99.999) for a system with a power-law performance distribution, you could be waiting much more than a billion samples to see a failure.

eg., 3E10 ms (30 bn samples) in a year, at 5-9s failure rate has 3E3 ms of failure (3 thousand samples) -- which will be lower given a sampling strategy.

rhymer 697 days ago

Relevance: from the cited paper, the variance of the median estimator is proportional to 1/(n * f^2), where n is the sample size and f is the density at median.

Two observations: 1. With sufficiently large n, you can control your variance to an acceptable level. 2. The factor f is outside of your control. If your problem has a small density around the median, then you'll need to throw more samples to compensate for it.

I think your concern is about #2: you can always construct a pathological distribution to make the sampling-based approach unattainable. The paper provided guidance on what to expect when you apply the sampling-based approach.

maronato 699 days ago

the key word is “uniformly”. If your data is not uniformly distributed, then you just have to pick random values non-uniformly. There are many ways to do that, and once you have your algo you’ll be able to reliably find an approximation of the median much faster than you would find the actual median.

https://en.m.wikipedia.org/wiki/Non-uniform_random_variate_g...

justinpombrio 698 days ago

I wasn't saying that you could get within 1% of the true median, I was saying you could find an element in the 49th to 51st percentile. In your example, the 49th percentile would be -90 and the 51st percentile would be 90, so the guarantee is that you would (with very high probability) find a number between -90 and 90.

That's a good point, though, that the 49th percentile and 51st percentile can be arbitrarily far from the median.

andruby 699 days ago

I was thinking the same thing.

In all use-cases I've seen a close estimate of the median was enough.

hhmc 696 days ago

You also can use the fact that for any distribution, the median is never further than 1SD away from the mean.