| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cosmic_quanta 701 days ago
	That is cool if you can tolerate approximations. But the uncomfortable questions soon arise: Can I tolerate an approximate calculation? What assumptions about my data do I to determine an error bound? How to verify the validity of my assumptions on an ongoing basis? Personally I would gravitate towards the quickselect algorithm described in the OP until I was forced to consider a streaming median approximation method.

3 comments

skrtskrt 701 days ago

A use case for something like this, and not just for medians, is where you have a querying/investigation UI like Grafana or Datadog or whatever.

You write the query and the UI knows you're querying metric xyz_inflight_requests, it runs a preflight check to get the cardinality of that metric, and gives you a prompt: "xyz_inflight_requests is a high-cardinality metric, this query may take some time - consider using estimated_median instead of median".

link

conradludgate 701 days ago

Well, I believe you could use the streaming algorithms to pick the likely median, so help choose the pivot for the real quickselect. quickselect can be done inplace too which is O(1) memory if you can afford to rearrange the data.

link

SkiFire13 701 days ago

Then you don't get a guaranteed O(n) complexity if the approximated algorithm happen to make bad choices

link

mgaunard 701 days ago

say you want the median value since the beginning of the day for a time series that has 1000 entries per second, that's potentially hundreds of gigabytes in RAM, or just a few variables if you use a p-square estimator.

link

mgaunard 701 days ago

any approximation gives a bound on the error.

link