Hacker News new | ask | show | jobs
by cosmic_quanta 701 days ago
That is cool if you can tolerate approximations. But the uncomfortable questions soon arise: Can I tolerate an approximate calculation? What assumptions about my data do I to determine an error bound? How to verify the validity of my assumptions on an ongoing basis?

Personally I would gravitate towards the quickselect algorithm described in the OP until I was forced to consider a streaming median approximation method.

3 comments

A use case for something like this, and not just for medians, is where you have a querying/investigation UI like Grafana or Datadog or whatever.

You write the query and the UI knows you're querying metric xyz_inflight_requests, it runs a preflight check to get the cardinality of that metric, and gives you a prompt: "xyz_inflight_requests is a high-cardinality metric, this query may take some time - consider using estimated_median instead of median".

Well, I believe you could use the streaming algorithms to pick the likely median, so help choose the pivot for the real quickselect. quickselect can be done inplace too which is O(1) memory if you can afford to rearrange the data.
Then you don't get a guaranteed O(n) complexity if the approximated algorithm happen to make bad choices
say you want the median value since the beginning of the day for a time series that has 1000 entries per second, that's potentially hundreds of gigabytes in RAM, or just a few variables if you use a p-square estimator.
any approximation gives a bound on the error.