Hacker News new | ask | show | jobs
by mrngm 78 days ago
That reminds me of this talk[0] by Gil Tene called "How NOT to Measure Latency" at the Strangeloop conference in 2015 (or read this blog post[1] that contains the most important points).

[0] https://www.youtube.com/watch?v=lJ8ydIuPFeU

[1] https://bravenewgeek.com/everything-you-know-about-latency-i...

1 comments

Author here. That was a great article, thanks for sharing. Especially the part about how your probability of experiencing a p99 latency is much higher than you'd intuit.

I don't agree with all of it, but definitely a few points made directly or indirectly hit home, such as:

- there is no single metric that can accurately represent "latency"

- most of our metrics are misleading in what they unconsciously include or exclude

I can remember once looking at a graph of requests/second and wishing I could see a distribution of requests per millisecond within an individual second. That level of detail is hard to come by, so in the meanwhile, we do what we can with the data we have.

If you have individual request logs with timing infomation, you could construct that afterwards. It does take some effort to have an effective way of displaying these metrics. Where would you put an individual request that took 532ms and started at t=34.682s? Would you align all requests that started in the 34th second at t=34s, or look at completion time (ie within t=35s)?

Would you rather see "number of requests started at this ms" (you seem to suggest this), or is something else more interesting?

I think a sort of Gantt chart that plots duration of requests as well as starting time within the time span (e.g. a second or more) might be very informative. Each individual request on a different position on the Y axis, time on the X axis. Perhaps you have some bound on requests in flight, that could be the height of the Y axis, so you can easily see calm or busy periods.

At least our observability stack doesn't show this level of detail, but it would be very interesting to have it. (We do have calculated heatmaps based on maximum request time in Grafana, which is at least better than plots of average request times)

Good question - the question of whether to log the millisecond when the request starts or ends is a great example of how complex these things are to think about accurately, let alone capture.

I'd want to log when the requests start, as I'm mostly concerned with how well-distributed request arrival was at that level of granularity.

I wondered if the network layers in between my client and server were effectively "smoothing" request arrival across each second, or if instead requests were very bursty so that a per-minute spike in a typical graph was dominated by a few seconds or milliseconds within that minute.