Hacker News new | ask | show | jobs
by homin 2492 days ago
Author here. We wanted to be able to graph p99, p99.9 metrics with arbitrary ranges, and found the existing solutions were not accurate enough for our needs. Happy to answer any questions.

Code here:

https://github.com/DataDog/sketches-go

https://github.com/DataDog/sketches-py

https://github.com/DataDog/sketches-java

7 comments

Nice work! Averaging percentiles is well-known to give terrible results. Glad to see more people, taking this problem serious, and providing viable alternatives!

A note on Accuracy: At Circonus, we have been using a version of HDR-Histograms [1] for many years to aggregate latency distributions, and calculate accurate aggregated percentiles. Accuracy was never a problem (worst-case error <5%, usually _much_ better).

If I read your evaluation results correctly, you also found HRD-Histograms to be as-accurate or more-accurate, than DDSketches, correct?

The differentiator to HDR Histograms seems to be merging speed and size, where DDSketches seem to have an edge.

One thing that is not immediately clear to me from reading the paper is, how much of the distribution function can be reconstructed from the sketch? E.g. for SLO calculations one is often interested in latency bands: "How many requests were faster than 100ms?" [2].

Is it possible to approximate CDF values ("lower counts") from the sketch with low/bounded error?

[1] https://github.com/circonus-labs/libcircllhist

[2] http://heinrichhartmann.com/pdf/Heinrich%20Hartmann%20-%20La...

HDR is great for the use-case when you can bound your range beforehand and merging is not a requirement, but those were also the reasons we needed to develop DDSketch.
Sorry, but I don't follow this argument:

(1) HDR-Histogram merges are 100% accurate and very fast (few microseconds)

(2) The range of HDR-Histogram is bounded in the same way that floating point numbers are bounded. Hence the name "High Definition Range":

> For example, a Histogram could be configured to track the counts of observed integer values between 0 and 3,600,000,000,000 while maintaining a value precision of 3 significant digits across that range.

http://hdrhistogram.github.io/HdrHistogram/

Circllhist offers a default range of 10^-128 .. 10^+127, which has been more than ample for all use-cases I have seen.

https://github.com/circonus-labs/libcircllhist/blob/master/s...

Yes, the sketch can quickly answer questions like "How many requests were faster than 100ms?" 100ms maps to a bucket index, and we would just sum up all the buckets with indices at most that.
FYI, already in 2015, I have proposed exactly the same idea as improvement to HdrHistogram. See https://github.com/HdrHistogram/HdrHistogram/issues/54 and corresponding code https://github.com/oertl/HdrHistogram/blob/memory_efficiency.... Unfortuantely, the author of HdrHistogram did not pick up my proposal as that would have lead to major changes and problems regarding compatibility. The mapping to histogram bins is the same as in your code https://github.com/DataDog/sketches-java/blob/master/src/mai.... I am curious, have you been inspired by that?
This sucks, but a problem we have in academia is that you can't reference a closed PR request at a top-tier conference like VLDB. If you wrote up your proposal as a referable document, you should rightly claim that you deserve scholarly recognition. Morally, you deserve recognition (which you are getting here, at least). It's decent of the author to admit he was inspired by you - there are others who would think it is easier to deny it. But there is no recourse, if it's only a PR.
I did see that thread and that's what inspired the mapping with the quadratic interpolation.

To clarify though, DDSketch as defined in the paper uses the logarithmic mapping, as our main goal was to make the memory footprint as small as possible. See: https://github.com/DataDog/sketches-java/blob/1650d939f1485f...

The Java implementation abstracts out the index mapping. This let us add alternative ways to map values to indices and we added the method with the quadratic interpolation as it seems to be a good tradeoff between index computation speed and memory footprint.

Thanks for publishing this. I admit I have only skimmed the paper and plan to look closer later today. My first critical thought was "I wonder why section 4 doesn't make comparisons to t-digest?" I think of t-digest as the most common mergeable streaming modern quantile algorithm in practice. Why didn't you include it in your comparisons?

Either way - looks very promising, I am excited to take a closer look and possibly use this. Thanks!

T-digest is definitely the best rank-accuracy sketch for higher percentiles in terms of size. However, it was much slower than GK, and we found that by increasing the rank-accuracy of GK (taking the penalty of a larger sketch-size), the results were not too different. Both of course had the issues inherent in all rank-accurate sketches in the higher percentiles over long-tailed data.
No background in this, my layman interpretation is: Track a histogram with log-scaled bins. These are easy to increment and only takes up size of the number of bins in terms of memory. Two histograms from different servers can easily be merged together since they use the same bins.

But then I got completely lost in the math to show why this guarantees anything. What's actually preventing the scenario where everything falls into the same or just a few bins? (as so happens in the real world where all your data is of a relatively similar order of magnitude) For example what would happen if my data is random from 1 to 10? And on a different server random 1000 to 1010?

> number of bins in terms of memory

Less than that, with the appropriate compression scheme (sparse encoding).

> But then I got completely lost in the math to show why this guarantees anything.

You get guarantees on _relative_ error:

- 100 to 110 is a 10% error

- 1000 to 1010 is a 1% error

Depending on your logarithmic base, you can detect errors up to a certain relative %-tage.

That makes a lot sense, thank you! I was stuck on the visualization analogy where choosing the wrong axes for the histogram makes it useless (e.g., when all the data falls into the same bin you can't see the shape of the distribution). But for answering quantile, this doesn't matter. Even if P1 and P99 are in the same bin you still know what their values are up to a multiplicative constant. Fantastic!
Is it fair to say that if you know the min and max values for your dataset, the DDSketch (fast) is strictly better than HDR Histogram for applications where lack of runtime memory allocations and add performance is critical?
In theory yes, but none of our implementations specifically prevent reallocations as we were not targeting this use-case. It should be easy enough to add though.
Minor nit from paper: The range for Moments is listed as bounded due to overflow risk using floats.

This may be true for the provided implementation, but it's not inherent to the algorithm. The calcuation of the higher order moments would normally done entirely in the log domain where there's little risk of overflow.

I've relied on HDR histograms for that. It looks like your sketches are a bit more performant in the paper, but what was the issue with accuracy that histograms couldn't fulfill?