|
|
|
|
|
by mehrdadn
2036 days ago
|
|
This is a horrible algorithm... just imagine trying to find the median of {100, 101, 102, 103, 104}... your estimate would be 5 which is ridiculous. At the very least, you probably don't want your estimate to be in {-1, 0, +1} after seeing one element -- you want it to be that element instead. The conditions required for this to converge are incredibly strict - it's cool from a theoretical standpoint regarding the memory usage, but I wouldn't use it as anything in practice. |
|
Further, I don't think "number of items > k*median" is a particularly grueling criterion for this algorithm (where k is some constant based on the delta).
Here is the first paragraph of the Introduction, for your reference:
> Modern applications require processing streams of data for estimating statistical quantities such as quantiles with small amount of memory. A typical application is in IP packet analysis systems such as Gigascope [8] where an example of a query is to find the median packet (or flow) size for IP streams from some given IP address. Since IP addresses send millions of packets in reasonable time windows, it is prohibitive to store all packet or flow sizes and estimate the median size. Another application is in social networking sites such as Facebook or Twitter where there are rapid updates from users, and one is interested in median time between successive updates from a user. In yet another example, search engines can model their search traffic and for each search term, want to estimate the median time between successive instances of that search.
You can also read the paper to see how they implement Frugal-2U, which has better convergence characteristics for twice the memory.
They even address your specific complaint: "Note that Frugal-1U and Frugal-2U algorithms are initialized by 0, but in practice they can be initialized by the first stream item to reduce the time needed to converge to true quantiles."