Hacker News new | ask | show | jobs
by karmakaze 2544 days ago
Yes in this case max of P99 latencies across all servers is useful. It will be high if any machine's P99 is high indicating that something is not normal. If it's a known ongoing condition, I would make two metrics, one that includes the trouble server in the aggregate and one that doesn't and alert on that to find out new unknown problems.
1 comments

Okay, I see what you mean. It is a statistic, and so far as it is what it is describes, it's true. And it has some use.

The problem is that what people are usually asking for, is the 99th percentile of response time for user requests, and they get the above statistic, not because they want it, but because they are limited by their monitoring systems; they accept it as a proxy for what they want, when it is actually something completely different, and may have completely different values.

It would be more useful to calculate the distribution for overall response time, and also calculate the distribution for each individual machine; you could then look for anomalies (in your statistics) that could identify which machines are performing badly.