|
|
|
|
|
by opportune
1164 days ago
|
|
I’m also skeptical any time someone mentions 5 nines uptime at a small scale. For one, it takes a lot of engineering to be able to actually monitor and detect that with precision (that is, how do you know you are 99.999 and not 99.995?) and with so few people there may be holes in what’s monitored (there are so many places you can drop requests/whatever and lose availability). There’s also tail risks like datacenter incidents (if your servers are on three racks in two data centers) or dependencies like power outages that you may be getting lucky on avoiding due to small scale, rather than amortizing over a huge fleet - that is to say, if there is a 1% risk per year that one of your racks goes down and takes you 3 9s when that happens, you are really at slightly under 4 9s, but with only a few racks it doesn’t happen most years. That last one is I suspect what makes it so small scale operators can achieve “5 9s” with a fraction of the engineering of larger operators. You can get a lot of 9s most years because you dodge infrequent risks. |
|
We have that in place. We run phone systems that businesses depend on, and we have SLAs that guarantee this uptime in order to secure customers. We have network engineers dedicated to everything from guaranteeing it on the cloud side to checking Wireshark traces for any hint of abnormalities, every day. When I say 3 people, I mean those of us writing the front-end, back-end, database procs, and code that the open-source libraries require, including forking and custom patches. We have other team members ensuring our HA pairs, load balancing, redundancy, fail-overs, and all the other associated technology is working as expected.
I won't get into the details, but we have not violated our SLAs, ever.
And you'd be surprised at what open-source software we are using to drive parts of this system. Kudos to them, they are helping us maintain this with some rock-solid software.