Hacker News new | ask | show | jobs
by mjb 1755 days ago
The other two answers you got are good. I will say that monitoring p99 (or 99.9 or whatever) is a good thing, especially if you're building human-interactive stuff. Here's my colleague Andrew Certain talking about how Amazon came to that conclusion: https://youtu.be/sKRdemSirDM?t=180

But p99 is just one summary statistic. Most importantly, it's a robust statistic that rejects outliers. That's a very good thing in some cases. It's also a very bad thing if you care about throughput, because throughput is proportional to 1/latency, and if you reject the outliers then you'll overestimate throughput substantially.

p99 is one tool. A great and useful one, but not for every purpose.

> Because I dont know anyone who has utilisation to 1 or even 0.5 in production.

Many real systems like to run much hotter than that. High utilization reduces costs, and reduces carbon footprint. Just running at low utilization is a reasonable solution for a lot of people in a lot of cases, but as margins get tighter and businesses get bigger, pushing on utilization can be really worthwhile.

1 comments

In my previous job me and latency-sensitive engineering teams in general mostly went with just four core latency measurements.[ß]

    - p50, to see the baseline
    - p95, to see the most common latency peaks
    - p99, to see what the "normal" waiting times under load were
    - max, because that's what the most unfortunate customers experienced
In a normal distributed system the spread between p99 and max can be enormous, but the mental mindset of ensuring smooth customer experience, with awareness that a real person had to wait that long, is exceptionally useful. You need just one slightly slower service for the worst-case latency to skyrocket. In particular, GraphQL is exceptionally bad at this without real discipline - the minimum request latency is dictated by the SLOWEST downstream service.

To be fair, it was a real time gambling operation. And we were operating within the first Nielsen threshold.

ß: bucketing these by request route was quite useful.

EDIT: formatting