Hacker News new | ask | show | jobs
by ksec 1756 days ago
OK. I am stupid. I dont understand the article.

>> If you must use latency to measure efficiency, use mean (avg) latency. Yes, average latency

What is wrong with measuring latency at 99.99 percentile with a clear guideline that optimising efficiency ( in this article higher utilisation ) should not have trade off on latency?

Because latency is part of user experience. And UX comes first before anything else.

Or does it imply that there are lot of people who dont know the trade off between latency and utilisation? Because I dont know anyone who has utilisation to 1 or even 0.5 in production.

3 comments

The other two answers you got are good. I will say that monitoring p99 (or 99.9 or whatever) is a good thing, especially if you're building human-interactive stuff. Here's my colleague Andrew Certain talking about how Amazon came to that conclusion: https://youtu.be/sKRdemSirDM?t=180

But p99 is just one summary statistic. Most importantly, it's a robust statistic that rejects outliers. That's a very good thing in some cases. It's also a very bad thing if you care about throughput, because throughput is proportional to 1/latency, and if you reject the outliers then you'll overestimate throughput substantially.

p99 is one tool. A great and useful one, but not for every purpose.

> Because I dont know anyone who has utilisation to 1 or even 0.5 in production.

Many real systems like to run much hotter than that. High utilization reduces costs, and reduces carbon footprint. Just running at low utilization is a reasonable solution for a lot of people in a lot of cases, but as margins get tighter and businesses get bigger, pushing on utilization can be really worthwhile.

In my previous job me and latency-sensitive engineering teams in general mostly went with just four core latency measurements.[ß]

    - p50, to see the baseline
    - p95, to see the most common latency peaks
    - p99, to see what the "normal" waiting times under load were
    - max, because that's what the most unfortunate customers experienced
In a normal distributed system the spread between p99 and max can be enormous, but the mental mindset of ensuring smooth customer experience, with awareness that a real person had to wait that long, is exceptionally useful. You need just one slightly slower service for the worst-case latency to skyrocket. In particular, GraphQL is exceptionally bad at this without real discipline - the minimum request latency is dictated by the SLOWEST downstream service.

To be fair, it was a real time gambling operation. And we were operating within the first Nielsen threshold.

ß: bucketing these by request route was quite useful.

EDIT: formatting

Percentiles are order statistics, they are robust and not sensitive to outliers. This is why sometimes they are very useful. And this is why they do not capture how big the remaining 0.01% of the data are.

Let's take a median, which is also an order statistics. And a sequence of latency measurements: 0.005 s, 0.010 s, 3600 s. Median latency is 0.010 s, and this number does not tell how bad latency can actually be. Mean latency is 1200.05 s, which is more indicative how bad the worst case is.

In other words, percentiles show how often a problem happens (does not happen). Mean values show the impact of the problem.

every statistic is a summary and a lie. p50/p99 metrics are good in the sense that they tell you a number someone actually experienced and they put an upper bound on that experience. they are bad because they won’t tell you is how the experience below that bound looks.

mean and all it’s variants won’t show you a number that someone in your system necessarily experienced, but it will incorporate the entire distribution and show when it has changed.

in the context of efficiency, mean is beneficial because it can be used to measure concurrency in a system via littles law, and will signal changes in your concurrency that a percentile metric won’t necessarily do.