| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nefariousoctopi 2211 days ago

I wouldn't say that I am great at stats either, but AFAIK this is mostly because of super high tail-latencies (usually the 99th percentile, commonly caused by garbage collection or cache misses) or because the latencies have multimodal distribution [1] (e.g. where you have a request fast path and slow path in the monitored system and so the latencies "group" around multiple points).

The average latency does not tell us much since it does not really represent anything meaningful - i.e. it is not the latency of typical user/request as you might expect since the average is being skewed by the extremely high tail-latencies [2] or by the multiple modes (the peaks in latency histogram). The typical latency would be more likely better represented by median latency (not in the multimodal distribution case afaik).

As for why not go just go with median latency: you usually need to make multiple requests in parallel and end up waiting for the slowest request of the group. The 95th, 98th or 99th percentile is commonly used to cover for this (sorry can't find a suitable reference). This is also preferable in case of the multimodal distribution (well at least for monitoring and/or general performance diagnosis purposes since you usually care about the worst/tail cases).

[1]: https://en.wikipedia.org/wiki/Multimodal_distribution

[2]: https://en.wikipedia.org/wiki/Skewness#/media/File:Relations...