Hacker News new | ask | show | jobs
by bostik 2826 days ago
> what is my service level objective?

There are environments where flat time distribution for SLO calculation is not acceptable. (cough betting exchange)

If your traffic patterns are extremely spiky, such as weekly peaks hitting 15-20x of your base load, and where a big chunk of your business can come from those peaks, then most normal calculations don't apply.

Let's say your main system that accepts writes is 10 minutes down in a month. That's easily good for >99.9% uptime, but if a failure + PR hit from an inconveniently timed 10-minute window can be responsible for nearly 10% of your monthly revenue, that's a major business problem.

So when setting SLOs, they should be set according to business needs. I may be a heretic saying this but not all downtime is equal.

1 comments

Time based SLOs definitely have their limitations, but in this instance isn't it fairly easy to redefine the SLO in terms of requests rather than time?
This is one of the recommendations given in the Google SRE book: use request-level metrics for SLOs/SLIs where possible. As your systems grow larger the probability of total outage, which would be measured in time, becomes a smaller fraction of the probability of partial outage.

Since total outages are a special case of partial outages, use metrics that cleanly measure partial outages. That's request error metrics.

I wish it was that easy - our teams have their targets for p99 and p995 ratios but they cannot capture the overall user experience. For us it's not just the ratio of failed requests, but closer to a four-tuple of:

  * maximum number of users affected
  * maximum time of unavailability
  * maximum observed latency 
  * highest ratio of failed requests over a sequence of relatively tight measurement windows
Those are demanding constraints, but such is reality when peak trading activity can take place within just a few minutes. If users can not place their trades during those short windows, they will quickly lose confidence and take their business elsewhere.

So yes, request ratio is certainly a good part of the overall SLO but covers only a portion of the spectrum.