|
|
|
|
|
by pclmulqdq
709 days ago
|
|
While this is correct in a literal sense, you are missing something that the profiler tells you about your p99 (and p99.9) tails of the end-to-end system: sources of "slowness" are often correlated in these requests. Some systems I have worked on have p99 times that are built out of a combination of 90th percentile events that you would find in a profiler. In this case, a profiler doesn't tell you about anything being particularly bad. Profilers also say nothing about queueing, and can very much mislead you if you care about latency in specific. If your "slowness" is driven by a single function (or made of truly uncorrelated events), you can accurately measure your tails with a profile. If not, a trace will give you meaningfully more information. |
|
Your slowness is always a function of the underlying building blocks, their performance distribution and bottleneck. And sure, two 90th percentiles can make for the 99th percentile. A profiler won't magically convey the information about what sequence of operations a request is doing under the hood.
I agree that having visibility into the requests via tracing can help zoom in on the problem. But so can having metrics on the underlying systems, e.g. if you have a queue in your system you could look at the performance of that queue.
I'll admit that most of my experience is tuning for maximal throughput rather than a given percentile, usually systems with high performance/throughput yield a much flatter distribution of latencies at a given workload. A rule of thumb. I also tend to think about my "budget" in the various parts of the system to get to my desired performance characteristics, a luxury you don't have on "legacy" systems you need to troubleshoot some behaviour on where tracing also lets you get a "cut" through the stack that shows you what's going on.