| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bostik 351 days ago

> The distribution is heavily lopsided with a long tail.

You'll see this in any properly active online system. Back in the previous job we had to drill it to teams that mean() was never an acceptable latency measurement. For that reason the telemetry agent we used provided out-of-the-box p50 (median), p90, p95, p99 and max values for every timer measurement window.

The difference between p99 and max was an incredibly useful indicator of poor tail latency cases. After all, every one of those max figures was an occurrence of someone or something experiencing the long wait.

These days, if I had the pleasure of dealing with systems where individual nodes handled thousands of messages per second, I'd add p999 to the mix.