Hacker News new | ask | show | jobs
by wting 2827 days ago
Time based SLOs definitely have their limitations, but in this instance isn't it fairly easy to redefine the SLO in terms of requests rather than time?
2 comments

This is one of the recommendations given in the Google SRE book: use request-level metrics for SLOs/SLIs where possible. As your systems grow larger the probability of total outage, which would be measured in time, becomes a smaller fraction of the probability of partial outage.

Since total outages are a special case of partial outages, use metrics that cleanly measure partial outages. That's request error metrics.

I wish it was that easy - our teams have their targets for p99 and p995 ratios but they cannot capture the overall user experience. For us it's not just the ratio of failed requests, but closer to a four-tuple of:

  * maximum number of users affected
  * maximum time of unavailability
  * maximum observed latency 
  * highest ratio of failed requests over a sequence of relatively tight measurement windows
Those are demanding constraints, but such is reality when peak trading activity can take place within just a few minutes. If users can not place their trades during those short windows, they will quickly lose confidence and take their business elsewhere.

So yes, request ratio is certainly a good part of the overall SLO but covers only a portion of the spectrum.