| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jkire 1893 days ago

> A common pattern in these systems is that there's some frontend, which could be a service or some Javascript or an app, which calls a number of backend services to do what it needs to do.

I think an important idea here is that you should be trying to measure the experience of a user (or as close as possible). If there is a slow service somewhere in your stack, but has no impact on user experience, then who cares? Conversely, if users are complaining that the app feels sluggish, then it doesn't matter if all your graphs say that everything is OK.

I find it helpful to split up graphs/monitoring into two categories: 1) if these graphs look fine then the service is probably fine, and 2) if problems are being reported then these graphs might give an insight into why things are going wibbly. In general, we alert on the former and diagnose with the latter. Of course, its nigh on impossible to get perfect metrics that track actual user experience, but we've definitely found it worthwhile to try and get as close as possible to it.

---

Another fun problem with using summary statistics is they can easily "lie" if the API can do a variable amount of work. For example, if you have a "get updates API" that is called regularly to see updates since the last call, then you end up with two "modes": 1) small amount of time between calls and so super fast and 2) a large amount of time between calls and so is slow. Now, in any given time period the vast majority of the calls are going to be super quick, but every user will hit the slow case the first time they open the app for the first time that day. This results in summary statistics that all but ignore those slow API calls when opening the app.

1 comments

jeffbee 1893 days ago

The flip side is that errors often pollute service latency statistics. If your service is capable of serving a fast failure, for example by returning 503 instantly for all requests when it is overloaded, you need another dimension in your statistics to handle that.

link