|
|
|
|
|
by BrendanLong
284 days ago
|
|
A big part of this is that CPU utilization metrics are frequently averaged over a long period of time (like a minute), but if your SLO is 100 ms, what you care about is whether there's any ~100 ms period where CPU utilization is at 100%. Measuring p99 (or even p100) CPU utilization can make this a lot more visible. |
|
The biggest problem was not variance in request rate it was variance in request cost, which is usually where queuing kicks in, unless you're being dumb about things. I think for a lot of apps p98 is probably a better metric to chase, p99 and p100 are useful for understanding your application better, but I'm not sure you want your bosses to fixate on them.
But our contracts were for p95, which was fortunate given the workload, or at least whoever made the contracts got good advice from the engineering team.