|
|
|
|
|
by mhawthorne
79 days ago
|
|
Author here. That was a great article, thanks for sharing. Especially the part about how your probability of experiencing a p99 latency is much higher than you'd intuit. I don't agree with all of it, but definitely a few points made directly or indirectly hit home, such as: - there is no single metric that can accurately represent "latency" - most of our metrics are misleading in what they unconsciously include or exclude I can remember once looking at a graph of requests/second and wishing I could see a distribution of requests per millisecond within an individual second. That level of detail is hard to come by, so in the meanwhile, we do what we can with the data we have. |
|
Would you rather see "number of requests started at this ms" (you seem to suggest this), or is something else more interesting?
I think a sort of Gantt chart that plots duration of requests as well as starting time within the time span (e.g. a second or more) might be very informative. Each individual request on a different position on the Y axis, time on the X axis. Perhaps you have some bound on requests in flight, that could be the height of the Y axis, so you can easily see calm or busy periods.
At least our observability stack doesn't show this level of detail, but it would be very interesting to have it. (We do have calculated heatmaps based on maximum request time in Grafana, which is at least better than plots of average request times)