Hacker News new | ask | show | jobs
by Fiadliel 2535 days ago
How about this: you have a load balancer that sends much less traffic to servers that are running slowly (but still a small amount of traffic, so that it recognises when the machine recovers). One backend server has a hardware problem; requests that usually take milliseconds now take seconds; the 99th percentile on that machine is much higher than anywhere else. But it's actually serving a tiny tiny fraction of traffic. Is the value useful in indicating the performance of the service as a whole?
1 comments

Yes in this case max of P99 latencies across all servers is useful. It will be high if any machine's P99 is high indicating that something is not normal. If it's a known ongoing condition, I would make two metrics, one that includes the trouble server in the aggregate and one that doesn't and alert on that to find out new unknown problems.
Okay, I see what you mean. It is a statistic, and so far as it is what it is describes, it's true. And it has some use.

The problem is that what people are usually asking for, is the 99th percentile of response time for user requests, and they get the above statistic, not because they want it, but because they are limited by their monitoring systems; they accept it as a proxy for what they want, when it is actually something completely different, and may have completely different values.

It would be more useful to calculate the distribution for overall response time, and also calculate the distribution for each individual machine; you could then look for anomalies (in your statistics) that could identify which machines are performing badly.