| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by opportune 1164 days ago
	I’m also skeptical any time someone mentions 5 nines uptime at a small scale. For one, it takes a lot of engineering to be able to actually monitor and detect that with precision (that is, how do you know you are 99.999 and not 99.995?) and with so few people there may be holes in what’s monitored (there are so many places you can drop requests/whatever and lose availability). There’s also tail risks like datacenter incidents (if your servers are on three racks in two data centers) or dependencies like power outages that you may be getting lucky on avoiding due to small scale, rather than amortizing over a huge fleet - that is to say, if there is a 1% risk per year that one of your racks goes down and takes you 3 9s when that happens, you are really at slightly under 4 9s, but with only a few racks it doesn’t happen most years. That last one is I suspect what makes it so small scale operators can achieve “5 9s” with a fraction of the engineering of larger operators. You can get a lot of 9s most years because you dodge infrequent risks.

1 comments

EMM_386 1164 days ago

> I’m also skeptical any time someone mentions 5 nines uptime at a small scale. For one, it takes a lot of engineering to be able to actually monitor and detect that with precision (that is, how do you know you are 99.999 and not 99.995?)

We have that in place. We run phone systems that businesses depend on, and we have SLAs that guarantee this uptime in order to secure customers. We have network engineers dedicated to everything from guaranteeing it on the cloud side to checking Wireshark traces for any hint of abnormalities, every day. When I say 3 people, I mean those of us writing the front-end, back-end, database procs, and code that the open-source libraries require, including forking and custom patches. We have other team members ensuring our HA pairs, load balancing, redundancy, fail-overs, and all the other associated technology is working as expected.

I won't get into the details, but we have not violated our SLAs, ever.

And you'd be surprised at what open-source software we are using to drive parts of this system. Kudos to them, they are helping us maintain this with some rock-solid software.

link

linza 1163 days ago

I think you cannot really salvage this argument. The way you describe it makes it sound to me that your company's success is ore likely due to luck and not just competence. It's also not clear if you really think this is a model that can scale to the size of Meta or if you just wanted to slide in a humblebrag.

link

908B64B197 1163 days ago

> It's also not clear if you really think this is a model that can scale to the size of Meta or if you just wanted to slide in a humblebrag.

It can't. It depends on "software services" hosted and coded by "other software companies" so his infra's SLA is basically outsourced to either Amazon, Microsoft or Oracle.

link

EMM_386 1163 days ago

> It depends on "software services" hosted and coded by "other software companies" so his infra's SLA is basically outsourced to either Amazon, Microsoft or Oracle.

It also depends on an incredible amount of in-house custom code.

If I deploy a change that breaks our Angular front-end, or a C# API change that has a typo that routes calls to the wrong places, or a configuration file for our open-source software that handles the phone systems, how exactly do we tell our thousands of customers they can't run their call centers? Or our restaurant customers can't take orders because their phone systems are dead?

Let's not be ridiculous. I'm not humble-bragging. I'm telling you what I do at my job.

link

deadly_syn 1160 days ago

Arent most peoples at this point?

link