Hacker News new | ask | show | jobs
by mehrdadn 2036 days ago
This is a horrible algorithm... just imagine trying to find the median of {100, 101, 102, 103, 104}... your estimate would be 5 which is ridiculous. At the very least, you probably don't want your estimate to be in {-1, 0, +1} after seeing one element -- you want it to be that element instead. The conditions required for this to converge are incredibly strict - it's cool from a theoretical standpoint regarding the memory usage, but I wouldn't use it as anything in practice.
7 comments

If you're trying to reduce the memory usage of calculating the median, you're not motivated by 5 element streams.

Further, I don't think "number of items > k*median" is a particularly grueling criterion for this algorithm (where k is some constant based on the delta).

Here is the first paragraph of the Introduction, for your reference:

> Modern applications require processing streams of data for estimating statistical quantities such as quantiles with small amount of memory. A typical application is in IP packet analysis systems such as Gigascope [8] where an example of a query is to find the median packet (or flow) size for IP streams from some given IP address. Since IP addresses send millions of packets in reasonable time windows, it is prohibitive to store all packet or flow sizes and estimate the median size. Another application is in social networking sites such as Facebook or Twitter where there are rapid updates from users, and one is interested in median time between successive updates from a user. In yet another example, search engines can model their search traffic and for each search term, want to estimate the median time between successive instances of that search.

You can also read the paper to see how they implement Frugal-2U, which has better convergence characteristics for twice the memory.

They even address your specific complaint: "Note that Frugal-1U and Frugal-2U algorithms are initialized by 0, but in practice they can be initialized by the first stream item to reduce the time needed to converge to true quantiles."

You need the cardinality of the data stream to substantially exceed the value of the median; it's also the case that if the mean is very high compared to the range (or variance), you'd do better setting your initial guess to the first item.

I think, as you suggest, it's more fun to say "given my distribution x, what's a good statistic for the median and what are its properties?" than "what's a good general purpose technique for finding a median of any distribution subject to computational constraint y"

You also need the range of the dataset to be much larger than 1 (or whatever step is used)
I don't know about this particular algorithm, but I believe some similar approaches only converge if you're estimating on values from some stable or slowly changing distribution. A monotonically increasing series would not be a good fit (nor a small dataset).
Yeah, those aren't relevant to what I was saying here. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see this algorithm estimate the median as N.
Fair points, and I agree on the need for a better initial estimate heuristic :-)
Maybe worth noting that while setting the initial estimate to the first element will help in a lot of cases, it will still fail catastrophically when you happen to get a small element in the beginning, which isn't all that unlikely (say if 1% of your data happens to be zero and the rest similar to before). I haven't researched it but I would imagine actually fixing the algorithm so that it succeeds with high probability in practical (but non-adversarial) cases may well be nontrivial.
This algorithm basically works when the number of elements to compute the median for is at least as large as the true median, and you've a fairly normal distribution of element values.
I think the intent is to calculate medians on longer running streams of data than that. Why would you even use this algorithm for 5 values?
> I think the intent is to calculate medians on longer running streams of data than that. Why would you even use this algorithm for 5 values?

I don't know if you realize but these kinds of comments are incredibly frustrating to respond to. It's like you didn't even try to understand the comment before replying. I was not saying "it's terrible because it easily fails on 5 elements". I was saying "it's terrible because it easily fails, and to illustrate, here's an example with 5 elements that will help you understand the problem I'm talking about". Does that make sense? Surely it's not rocket science to see that the number 5 wasn't special here? Surely you can see how, say, if your list starts at 2E9 and goes up to 3E9, the exact same problem would occur for the billion-element list, and hence that my gripe was probably not about 5-element lists?

It's an online algorithm. It's meant to be used on essentially infinite streams of data, such as a live dashboard of latencies of a running server. In that context, providing a 5 element, or any finite list as an example seems like a non sequitur.

Your examples do relate to problems the algorithm actually has in this application, but they manifest as things like "extremely large warmup time" and "adjusting to a regime change in the distribution taking time proportional to the change". For instance if your data is [1000, 1001, 1000, 1001...] then it takes 1000 steps to converge, which may be longer than the user has patience for.

However, the algorithm does always converge eventually, as long as the stream is not something pathological like an infinite sequence of consecutive numbers (for which the median is undefined anyhow).

If you have a monotonically increasing stream of numbers no algorithm is going to converge on a meaningful average, even if it's "correct" in a technical sense. This algorithm is intended to give a (very) cheap estimate for values with some kind of organic distribution (probably normal).

It doesn't have to be accurate for all possible edge cases because that's not the point of cheap estimation like this.

What I said has absolutely nothing to do with monotonicity. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see the algorithm estimate the median as N, which is below even the minimum.
> [2N, 2N + 1, ..., 3N - 1]

This is practically the definition of monotonicity. (https://en.wikipedia.org/wiki/Monotonic_function)

You should read the paper, it doesn't have the problems you think it has when used on real world data.

The parent mentioned permutation of those numbers. Does that affect your response?

Also note that the parent wasn't commenting on the algorithm in this HN submission, but rather the algorithm described in the top-level comment.

Quote from the paper cited in the post describing this algorithm:

> These algorithms do not perform well with adversarial streams, but we have mathematically analyzed the 1 unit memory algorithm and shown fast approach and stability properties for stochastic streams

Your criticism is really pretty strident for saying something the authors probably completely understand and agree with.

I think it's assuming that the dataset is i.i.d. distributed and sufficiently large.
Adjusting the algorithm to fit a usecase is trivial though