Hacker News new | ask | show | jobs
by EMM_386 1164 days ago
> I’m also skeptical any time someone mentions 5 nines uptime at a small scale. For one, it takes a lot of engineering to be able to actually monitor and detect that with precision (that is, how do you know you are 99.999 and not 99.995?)

We have that in place. We run phone systems that businesses depend on, and we have SLAs that guarantee this uptime in order to secure customers. We have network engineers dedicated to everything from guaranteeing it on the cloud side to checking Wireshark traces for any hint of abnormalities, every day. When I say 3 people, I mean those of us writing the front-end, back-end, database procs, and code that the open-source libraries require, including forking and custom patches. We have other team members ensuring our HA pairs, load balancing, redundancy, fail-overs, and all the other associated technology is working as expected.

I won't get into the details, but we have not violated our SLAs, ever.

And you'd be surprised at what open-source software we are using to drive parts of this system. Kudos to them, they are helping us maintain this with some rock-solid software.

1 comments

I think you cannot really salvage this argument. The way you describe it makes it sound to me that your company's success is ore likely due to luck and not just competence. It's also not clear if you really think this is a model that can scale to the size of Meta or if you just wanted to slide in a humblebrag.
> It's also not clear if you really think this is a model that can scale to the size of Meta or if you just wanted to slide in a humblebrag.

It can't. It depends on "software services" hosted and coded by "other software companies" so his infra's SLA is basically outsourced to either Amazon, Microsoft or Oracle.

> It depends on "software services" hosted and coded by "other software companies" so his infra's SLA is basically outsourced to either Amazon, Microsoft or Oracle.

It also depends on an incredible amount of in-house custom code.

If I deploy a change that breaks our Angular front-end, or a C# API change that has a typo that routes calls to the wrong places, or a configuration file for our open-source software that handles the phone systems, how exactly do we tell our thousands of customers they can't run their call centers? Or our restaurant customers can't take orders because their phone systems are dead?

Let's not be ridiculous. I'm not humble-bragging. I'm telling you what I do at my job.

Arent most peoples at this point?