Hacker News new | ask | show | jobs
by mherdeg 1738 days ago
I've skimmed some of the literature here when I've spent time trying to help people with their bucket boundaries for Prometheus-style instrumentation of things denominated in "seconds", such as processing time and freshness.

My use case is a little different from what's described here or in a lot of the literature. Some of the differences:

(1) You have to pre-decide on bucket values, often hardcoded or stored in code-like places, and realistically won't bother to update them often unless the data look unusably noisy.

(2) Your maximum number of buckets is pretty small -- like, no more than 10 or 15 histogram buckets probably. This is because my metrics are very high cardinality (my times get recorded alongside other dimensions that may have 5-100 distinct values, things like server instance number, method name, client name, or response status).

(3) I think I know what percentiles I care about -- I'm particularly interested in minimizing error for, say, p50, p95, p99, p999 values and don't care too much about others.

(4) I think I know what values I care about knowing precisely! Sometimes people call my metrics "SLIs" and sometimes they even set an "SLO" which says, say, I want no more than 0.1% of interactions to take more than 500ms. (Yes, those people say, we have accepted that this means that 0.1% of people may have an unbounded bad experience.) So, okay, fine, let's force a bucket boundary at 500ms and then we'll always be measuring that SLO with no error.

(5) I know that the test data I use as input don't always reflect how the system will behave over time. For example I might feed my bucket-designing algorithm yesterday's freshness data and that might have been a day when our async data processing pipeline was never more than 10 minutes backlogged. But in fact in the real world every few months we get a >8 hour backlog and it turns out we'd like to be able to accurately measure the p99 age of processed messages even if they are very old... So despite our very limited bucket budget we probably do want some buckets at 1, 2, 4, 8, 16 hours, even if at design time they seem useless.

I have always ended up hand-writing my own error approximation function which takes as input like

(1) sample data - a representative subset of the actual times observed in my system yesterday

(2) proposed buckets - a bundle of, say, 15 bucket boundaries

(3) percentiles I care about

then returns as output info about how far off (%age error) each estimated percentile is from the actual value for my sample data.

Last time I looked at this I tried using libraries that purport to compute very good bucket boundaries but they give me, like, 1500 buckets with very nice tiny error, but no clear way to make real-world choice about collapsing this into a much smaller set of buckets with comparatively huge, but manageable, error.

I ended up just advising people to

* set bucket boundaries at SLO boundaries, and be sure to update when the SLO does

* actually look at your data and understand the data's shape

* minimize error for the data set you have now; logarithmic bucket sizes with extra buckets near the distribution's current median value seems to work well

* minimize worst-case error if the things you're measuring grow very small or very large and you care about being able to observe that (add extra buckets)

3 comments

One thing you could try to use are variance optimal histograms. These are histograms which set bucket boundaries such that the weighted average variance in the buckets is minimized. Unfortunately, this algorithm is quadratic with the number of data points, but you could try approximating the optimum by taking random subsets of the data, computing the histogram, then seeing how well the histogram does on the whole dataset.

http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/pap...

NB: Post author here.

This is interesting and I totally get at least some of the problems you're facing. I wonder if you could take some of the strategies from t-digest and modify a bit to accomplish...I'd be interested in seeing some sort of implementation of this and would love to see if we can get it into our toolkit if you do...or you can also open up a ticket for us and we'll see if we can prioritize to work on something like it ourselves.

I do think there are some interesting ways you can cut corners if you know things about the "SLO" or other sort of cutoff values, but I'd have to think more deeply about it to say more. Essentially we want a variable error rate based on the distance away from a known value, where you have little error in the values relatively close to the known value, care little if at all for fine distinctions on the low end and, once you get past some high end value you could probably bucket everything above it into a "large outliers" bucket or something too...meaning you p999 could get out of hand if you start getting lots of stuff above a threshold, but, if that starts happening, probably everything's already burning down, so might not be that useful to know it's burning down in a very specific way...

Are you following the work on sparse high resolution histograms in both Prometheus and OpenTelemetry?

It's coming together and will be available soon.

What this means is that you won't have to compromise and pick histogram bucket boundaries anymore. And each bucket will be much narrower.

look at Greenwald-Khanna (and the followon work). It adapts the bucket size to minimize the total error (and the number of buckets is proportional to log-epsilon I think).

with all the competition and 'innovation', you would think the data dogs of this world would grow up beyond time series.