| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sokoloff 1687 days ago

By far the easiest way is to measure it after the fact, but I know that’s not what you’re asking... :)

We did do some "analysis", meaning that we made some underlying guesses and multiplied them together, but the real value is in getting people to think that 1.000 is not the actual goal-line, then tracking and doing RCA on all the outages, bucketing them into categories so you know whether to invest more in diverse networking, software testing, HA for DB servers, failover sites, zero downtime releases, etc.

Many times, you can avoid entire massive projects (“we need to be hosted in 2 geographically diverse data centers for availability” “uh, no we don’t; we have a budget of 262 minutes of downtime per year and that project will save us less than 60 minutes per year on average, using the best case assumption that our own changes to implement it cause no downtime”)