Hacker News new | ask | show | jobs
by zawerf 2484 days ago
No background in this, my layman interpretation is: Track a histogram with log-scaled bins. These are easy to increment and only takes up size of the number of bins in terms of memory. Two histograms from different servers can easily be merged together since they use the same bins.

But then I got completely lost in the math to show why this guarantees anything. What's actually preventing the scenario where everything falls into the same or just a few bins? (as so happens in the real world where all your data is of a relatively similar order of magnitude) For example what would happen if my data is random from 1 to 10? And on a different server random 1000 to 1010?

1 comments

> number of bins in terms of memory

Less than that, with the appropriate compression scheme (sparse encoding).

> But then I got completely lost in the math to show why this guarantees anything.

You get guarantees on _relative_ error:

- 100 to 110 is a 10% error

- 1000 to 1010 is a 1% error

Depending on your logarithmic base, you can detect errors up to a certain relative %-tage.

That makes a lot sense, thank you! I was stuck on the visualization analogy where choosing the wrong axes for the histogram makes it useless (e.g., when all the data falls into the same bin you can't see the shape of the distribution). But for answering quantile, this doesn't matter. Even if P1 and P99 are in the same bin you still know what their values are up to a multiplicative constant. Fantastic!