Hacker News new | ask | show | jobs
by sokoloff 1688 days ago
I was surprised by the difficulty in getting a company to accept a target of “three nines five” (0.9995) at a time when they were growing rapidly and launching new physical and digital products on a rapid and continuous basis. I prevailed, but what I expected would be a five minute conversation took a couple 45 minute discussions (reducing the work uptime of people in those discussions to 0.9993 for the year... :) )

Slowing your young company down in order to turn 0.9995 to 0.9998 is almost always a terrible trade. Even turning 0.995 to 0.999 is hard to justify in most places. (That improvement saves about 35 hours of downtime per year.)

1 comments

Is there a rigorous framework to arrive at those targets? How do you know what you built has 0.9995 uptime, and not just 0.99?
By far the easiest way is to measure it after the fact, but I know that’s not what you’re asking... :)

We did do some "analysis", meaning that we made some underlying guesses and multiplied them together, but the real value is in getting people to think that 1.000 is not the actual goal-line, then tracking and doing RCA on all the outages, bucketing them into categories so you know whether to invest more in diverse networking, software testing, HA for DB servers, failover sites, zero downtime releases, etc.

Many times, you can avoid entire massive projects (“we need to be hosted in 2 geographically diverse data centers for availability” “uh, no we don’t; we have a budget of 262 minutes of downtime per year and that project will save us less than 60 minutes per year on average, using the best case assumption that our own changes to implement it cause no downtime”)

If you're a large corporation one way to get a good idea is by having lots and lots of fire drills around various disaster scenarios and time how long actual service restoration and re-routing takes. For other companies it's just guess work.