| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spondyl 1888 days ago

Ah, a comment where I can put on my SRE (Site Reliability Engineering) hat :)

You're completely right that a 100% availability is unreasonable and often times, never required despite what a customer or site operator may believe.

Just a quick aside, availability (can an end user reach your thing) is often confused with uptime (is your thing up). If I operate a load balancer that your service sits behind and my load balancer dies, your service is up, but not availabile for those on the other side of said load balancer.

With that in mind, Hacker News could be theoretically up 100% of the time but if I go through a tunnel while scrolling Hacker News on my mobile phone, from my perspective, it is no longer 100% available, it is 100% - (period I was without signal) available, from my personal perspective as a user.

The point here is that a whole host of unreliable things happen in every day life from your router playing up to sharks biting the undersea cables.

With that in mind, you then want to go and figure out a reasonable level of service to provide to your end users (ask for their input!) that reflects reality.

It's worth noting too that Google (I don't love 'em but they pioneered the field) will actually intentionally disrupt services if they're "too available" so as to keep those downstream on their toes. It's not actually good for anyone if you have 100% availability in that they make too many assumptions and also, it's just good practice I suppose.

I can recommend reading the SLOs portion of the Google SRE book if you're curious to see more: https://sre.google/sre-book/service-level-objectives/

In short, an SLO is just an SLA without the legal part so a guarantee of a certain level of service, often internally from one team to another.

Ideally these objectives reflect the level of service your customers (internal or external) expect from your service

> Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region.

> Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.

> The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system.

> In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.