| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jgehrcke 1960 days ago

Thanks for bringing this topic to this thread. I'm a physicist by heart and education myself and observe that in the software/observability industry we like to collect data much more than we're interested in properly processing and interpreting it.

> Understanding the distribution of your data (rather than just averages) is arguably the most important feature you want from a monitoring dashboard, so the weak support for quantiles is very limiting.

So much yes! It's a relief to see that we have people here in this thread (and industry) who understand this :-).

People that have a deep background and experience in experimentation, measurement, and quantification rightfully have to see the nature of the data distribution first before they feel in any way OK about proceeding with aggregates.

Parent commenter knows this, but for people reading along: using aggregates (such as mean, standard deviation, standard error, quantiles, ...) implies dropping information. Going from the full distribution to a simplified representation naturally implies that what we talk about is a lossy transformation of data. Of course, one wants to be smart about _which_ information to drop. It should be intuitive that one can only be smart about this choice when having knowledge about the underlying distribution. Often, data is not normally distributed, not Poisson-distributed, but instead somewhat uniquely distributed based on the use case -- in a way that deserves brief characterization (a quick look is often enough!); which then allows for making informed decisions about which aggregate parameters to look at -- and which pieces of information are fine to drop.

> Histograms require manually specifying the distribution of your data, which is time-consuming, lossy, and can introduce significant error bands around your quantile estimates.

Yes! Great point. Honestly, I was a little bit shocked when I saw how this works in the Prometheus ecosystem. I happen to have an example for this I think: we (Opstrace) have contributed a tiny patch to Cortex where we changed the parameterization of a specific histogram metric, because the upper band was super broad, leading to a blind spot (a lack of resolution) in the range of values that was most interesting to us -- see https://github.com/cortexproject/cortex/issues/2530 and https://github.com/cortexproject/cortex/pull/2540.

I feel like in our industry when people do the readout and perform some basic dashboarding they're often OK with using aggregates that have not been sanity-checked. That might sometimes be a fair approach. If a readout is not useful, after all one will learn about that very fact through incidents :).

> Veneur[2] addresses these use-cases for applications that use DogStatsD[3] by using clever data structures for approximate histograms[4] and approximate sets[5]

This is extremely interesting, thanks for sharing. We'll have a look!

Having performed many different kinds of data analyses in my science career, I'm rather convinced of the idea that there should be an easy way to perform advanced (yet standard) data analyses for irregular distributions in multi-dimensional data, such as clustering (hierarchical, k-means, DBSCAN, you name it), Principal Component Analysis (PCA), and the likes. I have ideas in this regime -- let's see what makes sense and how far we get with Opstrace!

Again, thanks for the links and for sharing your perspective.