Hacker News new | ask | show | jobs
by how_gauche 2825 days ago
A rational way of explaining it is -- what is my service level objective? If my SLO is for two nines of uptime, then I can be down for 3.65 days a year, and I can get away with single-homing something, or going with a simple hot-failover replicated setup. I just need to be pretty confident that I can fix it if it breaks within a few hours.

If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime), then I'm literally forced to go distributed with N >= 3 and geographical redundancy (we stay up if any k of these machines fail, or if network to "us-east-4" goes down). Things get worse (and you need to be more paranoid) if your service level obligation turns into a service level agreement, because then it usually costs you money when you mess up.

Of course, the more distributed you are, the slower and more complicated your system becomes, and the more you expose yourself to the additional associated downtime risks ("oops, the lock service lost quorum"). They usually cost more to run, obviously. C'est la vie. There is no magic bullet in software engineering.

It used to be that you could spend 3X or more in engineering effort getting a distributed system up and running vs its singly-homed alternative. These days with cloud deployments (and Kubernetes for on-prem) you get a lot of labor-saving devices that make rolling out a distributed system a lot easier than it used to be. You still need to do a cost/benefit analysis!

3 comments

> what is my service level objective?

There are environments where flat time distribution for SLO calculation is not acceptable. (cough betting exchange)

If your traffic patterns are extremely spiky, such as weekly peaks hitting 15-20x of your base load, and where a big chunk of your business can come from those peaks, then most normal calculations don't apply.

Let's say your main system that accepts writes is 10 minutes down in a month. That's easily good for >99.9% uptime, but if a failure + PR hit from an inconveniently timed 10-minute window can be responsible for nearly 10% of your monthly revenue, that's a major business problem.

So when setting SLOs, they should be set according to business needs. I may be a heretic saying this but not all downtime is equal.

Time based SLOs definitely have their limitations, but in this instance isn't it fairly easy to redefine the SLO in terms of requests rather than time?
This is one of the recommendations given in the Google SRE book: use request-level metrics for SLOs/SLIs where possible. As your systems grow larger the probability of total outage, which would be measured in time, becomes a smaller fraction of the probability of partial outage.

Since total outages are a special case of partial outages, use metrics that cleanly measure partial outages. That's request error metrics.

I wish it was that easy - our teams have their targets for p99 and p995 ratios but they cannot capture the overall user experience. For us it's not just the ratio of failed requests, but closer to a four-tuple of:

  * maximum number of users affected
  * maximum time of unavailability
  * maximum observed latency 
  * highest ratio of failed requests over a sequence of relatively tight measurement windows
Those are demanding constraints, but such is reality when peak trading activity can take place within just a few minutes. If users can not place their trades during those short windows, they will quickly lose confidence and take their business elsewhere.

So yes, request ratio is certainly a good part of the overall SLO but covers only a portion of the spectrum.

> If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime).

Three nines is 8 hours and 45 minutes. In my experience with quality hardware, at a facility with onsite spares, that gives you two to three hardware faults, assuming your software and network is perfect, which I think is a fair assumption :)

Definitely the hardest problem when you switch to a distributed database is you have to contemplate failure. When your SPOF database is down, it's easy to handle, nothing works. When your distributed database partially fails, chances are you have a weird UX problem.

Thank you very much, especially that first paragraph. I think you've given me some good discussion tools, and I really appreciate that.