| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vlovich123 2812 days ago

Mathematics is not something handed down by the gods. It's possible to encounter not just completely new problems, but also limitations to existing methods for solving a known problem.

In this particular case the challenge is aggregating statistics from a very large fleet & having automated alarms. Visualization tools don't help with any of that. More specifically, the reporting tools out there apparently have a very common & persistent flaw of reporting an average of percentiles across agents which is a statistically meaningless metric. It makes no difference how you visualize it - the data is bunk.

This article flips it so that agents simply report how many requests they got & how many exceeded the required threshold. This lets them report the percentage of users having a worse experience than the desired SLA. You can also build reliable tools on top of this metric. It's not a universal solution but it's a neat trick to maintain the performance properties of not needing to pull full logs from all agents & still have a meaningful representation of the latency of your users.