Hacker News new | ask | show | jobs
Why Average Latency Is a Terrible Way To Track Website Performance (mvolo.com)
40 points by mvolo 4876 days ago
8 comments

TL;DR: Average anything is a terrible way to track anything. (And median or mode are bad, too). Any single-scalar value that compresses information that is best expressed as a graph (or multiple graphs!) is immensely lossy to the point where arguably it obfuscates more than it makes clear.

Back when we had to live with sort of printing-press methods of displaying information (ie, where anything that wasn't pure text was very difficult to display), mean/median/mode numbers were a necessary evil. But if you're looking at a computer screen, there's really no reason to subject yourself to an abstraction that throws out 90% of your data.

This was one of the more interesting realizations when I was an undergraduate writing my first research paper. We were testing latency of MIDI interfaces, and after sanity checking by looking at some of the underlying data, realized that average, or even average+stddev, was obscuring a lot of stuff. For example, note-to-note consistency is a major issue in music interfaces, often more important than absolute latency, since the spacing between notes is very important to melody perception (games often have a similar issue).

Showing the full histogram isn't a full solution either, though. Not only does using the average latency obscure the issue by boiling it down to a single scalar, but the full histogram of latencies also loses the information on note-to-note consistency! That's because a latency histogram loses sequencing information, so it doesn't distinguish between the case where you had a lot of 20ms latencies in a row followed by a lot of 50ms latencies in a row, and the case where every other message oscillated between 20ms and 50ms latencies (much worse). You can try to capture some of that information by making a histogram of adjacent-latency deltas, as one attempt. Or you can capture a different view on it by plotting latency vs. time and looking for spikes (but that can obscure less-obvious trends, and is unwieldy as a data representation if you're trying to summarize a system's behavior over a period of hours).

The paper is here, though the actual numbers are 9 years old at this point, so probably not that useful: http://www.cs.hmc.edu/~bthom/res/midi_timing/publications/IC...

> Average anything is a terrible way to track anything.

Came here to say exactly this. And averages are especially insidious when used for data that doesn't have a symmetric distribution, like most latencies.

hi Steve,

Author here. I think most people on HN would echo your sentinment about averages wholesale ... But I wanted to go a little deeper into selecting a better alternative for operational monitoring.

Its easy to say "averages are bad" but harder to say "use X instead", and explain why. We tried. Do you think we did it?

Well the title seems a bit childish (since obviously everybody on HN knows it's a terrible idea.) Why don't you change to post title to more appropriately reflect what you were trying to propose as an alternative.
Additional standard statistics like mode, median, quartiles etc are really useful.

And you can always throw things into gnuplot to get a quick, exploratory look at things. It will at least give you sense of whether you're looking at a normal distribution, something skewed, multi-modal distributions etc etc.

Hi, author here.

I am in complete agreement. Unfortunately, a lot of monitoring and APM tools still lead with average response time as one of the toplevel metrics. And a lot of people still make incorrect assumptions based on it.

Although, the percentile on average latency is not great either. I try to make the case for using a metric that counts acceptable experiences vs. their latency value, e.g. the Apdex index or our derived sat score.

Best, Mike

I think The Tech Report has my favorite benchmark graphs. They sort the data points by latency so you can intuitively see the distribution of your samples. e.g. http://techreport.com/review/24022/does-the-radeon-hd-7950-s...
I almost completely agree with you. I often tell people that statistics is the study of compressing information in useful ways. That said, scalar statistics can be very useful if the compression is 'correct'. For example, if you have an a priori reason to believe a distribution will be gaussian (a very common situation, and an assumption that basically allowed statistics to be grow to where it is today), mean and variance will fully describe the distribution. Many other common distributions can be fully described by a small number of parameters.
Michael Abrash talked about this in his black book of graphics programming.

WHen he was writing quake, they could trade off between lighting fast graphics (40fps+, on a 486) 99% of the time with the occasional horrible slowdown to less than 5fps. vs a steady frame rate that never changed much, but wasn't terribly fast.

Turns out people notice the occasional horrible lag much more than when things perform uniformly.

When tuning a performance critical service, focus on the outliers.

I think you should not ignore either. By default, think about 99%ile and 50%ile when tuning and optimizing. Depending on the context (e.g. games), even 99%ile might not be enough, or you might want to think about 99%ile of what? Frames? Scenes? Seconds of gameplay?

Also, back to the topic of the article at hand, I hope that their "T" is not really two seconds. That is already way too slow for most web purposes.

One problem I have with this approach is that it requires you to pick a threshold after which the response is "too slow." This number can change a lot over the course of an application lifetime, and would be hard to pick objectively anyways.

Median latency -- perhaps with (the smoothing-effect of) a rolling median -- would be more robust to outliers without having to resort to hardcoding of "too slow" thresholds. It would still require the human to connect the dots (e.g. median latency of >200 is "too slow") but it's an improvement on mere average response time for reasons noted.

This actually calls for advances in analytics packages. If you could specify the threshold on a per page (or regular expression for complex requests), and have the system track and notify you of threshold exceptions, this wouldn't be too difficult to manage with some nice sliders.

Sounds like a good problem for an analytics startup to tackle.

I agree with this. It's hard to gauge what is acceptable because it really depends on the application. So many other dependencies when dealing with latency and how it effects performance.
hi, author here.

Unfortunately, you HAVE TO do it. If you do not set a threshold for what is acceptable, how do you determine whether or not your are providing an acceptable experience to your users?

No amount of aggregate metrics can help you answer this question unless you know whats acceptable, and what isnt - for each important set of URLs in your app.

I agree that its "hard" to do. In our own product (https://www.leansentry.com), we solve this problem by grouping urls, and using good defaults / making it easy to override the thresholds for users.

There so many moving parts that this could either be a great tool to analyze data or it could open up a can of worms which could lead down the road to network re-architecture. If this takes into account best effort or SLA based ISP, Network topology, QoS, Packet Prioritization, etc then I think it could be a useful tool. Without it it's just a tool that spits out pretty pictures. If your main selling point is data then it has to be more than just what latency can show.
hi there,

The post is about selecting a top level metric for monitoring website performance. One a problem is indicated, you would definitely need to drill in to figure out what part of your app is affected, when, and what caused it.

LeanSentry (our own application monitoring product, https://www.leansentry.com) does this. However, describing this was outside the scope of my post (but you can see the demo of it on the website).

I think there's something to be said about keeping some key metrics super simple so that "everybody" can understand without having to refer to a formula or arbitrarily set thresholds. I've been using 99 and 90 percentile avg performance. It captures enough information in most cases and doesn't require any explanation.
Hi edouard,

I completely agree! Keeping toplevel metrics SIMPLE is the key. Of course, simple but also not misleading you into any wrong beliefs.

While we liked the 95 percentile approach, we decided against it. Its still too focused on the actual response time itself, which we thought was less relevant than the number of users experiencing bad performance.

I think for us the bottom line was:

A) If you are having a site-wide performance issues, 95% percentile is a good metric.

B) However, if you have more isolated issues (we find this happens more often to more mature sites), satisfaction score is better.

Best, Mike

Im seeing a lot of "averages are bad" etc but I think you come closest to what I had in mind: there isnt anything inherently wrong with using simple metrics. The caveat is you just need to keep in mind and understand their limitations and where they fall down. I think a lot of people understand that using 99 or 95 percentiles and what not but just failed to lay the reasoning out.
Posted this comment on the article, but thought it would be useful here as well:

Good points on why average latency is a bad metric, and while the idea behind Apdex was good, it never ended up being the right measure. The Apdex score still depends on a HiPPO (Highest Paid Person''s Opinion) to determine what T should be, and this can change over time.

At SOASTA (and previously at LogNormal), we borrowed the concept of LD50 (the median lethal dose) from biology. The LD50 value has the property of adapting to what your audience thinks rather than what your HIPPO thinks is a good experience.

We described the method at the Velocity conferences (Santa Clara and London) last year, and wrote it up in a blog post here: http://www.lognormal.com/blog/2012/10/03/the-3.5s-dash-for-a...

Hope you find it interesting.

I should also mention that it's useful to apply some kind of smoothing to timeseries data (like latency over time). Holt-Winters double-exponential smoothing is particularly good at this. What it does is smooth out temporary glitches and show you when things turn unexpectedly bad. If you've ever received a page and said, "Oh yeah, that one... that goes away in 3 seconds. Happens every day.", then you'll find this useful. H-W D-E smoothing only shows you the ones that don't go away after 3 seconds.

I thought looking at the 99th (or other) percentile was pretty standard practice?
Depends on standard - within the clued-in performance community, yes, but there are major, major companies still pushing averages and that causes a lot of people, particularly those without much stats / engineering background, to expect it everywhere.

To use one example which is prevalent throughout marketing, advertising, etc. Google Analytics reports only averages – this makes the results unreliable enough that I'm now advising people to simply pretend that field does not exist as it's completely untrustworthy. Awhile back I blogged about an example where 3 samples out of 200K threw the average off by a full order of magnitude: http://chris.improbable.org/2012/05/18/google-analytics-dece...

Very interesting, thank you. I especially like the replies from the Google analytics team, 8 months apart, that both acknowledge the issue and say they'll fix it...
Also that in addition to not having fixed it, accurate stats are apparently less of a priority than, say, a gigantic fixed-position toolbar. That's been disappointing…
hi Jabbles,

Author here. The 99 or 95 percentile is a much better metric! We also make the case for the industry standard Apdex or our derived metric, sat score. These are becoming more and more in use by APM tools like us or New Relic.

Unfortunately, many existing tools and people who use them still look at latency aggregates and often make incorrect assumptions.

Searched the page for "standard deviation". Didn't find it. Hit the back button.
Standard deviation isn't the problem, skew is. Yes, skew will increase the standard deviation, but the heart of the issue here is how fat the right tail of the distribution is.

Standard deviation is often a useful metric, but it's at least as flawed as mean in skewed distributions because it doesn't treat either direction around the (already flawed) mean any differently.

hi nateabele,

Author here. Did you also search for stdev, st.dev, variance? Just kidding.

The post is not about averages. Its about selecting the right metric to track website performance. Standard deviation would surely qualify the avg. latency a bit, but it would still be a pretty lame alternative to using a better toplevel metric like Apdex.

Best, Mike