Hacker News new | ask | show | jobs
by throwaway1946 1164 days ago
Sorry, but I'm skeptical that you actually run "an entire phone company" in the way that most people understand the term. You mention that you use lots of open-source solutions, and I'm guessing you outsource the build and operation of the network to a real phone company, probably similar to an MVNO. Am I wrong?

Meta is very different from that, they build the products that users interact with, but also build things at the bottom of the tech stack. At their scale, this makes business sense to do so, and comparing your headcount with theirs makes no sense.

I don't really understand why software engineers keep dunking on each other like this. I get that people want to broadcast how smart they are, but in reality we're just giving the general public a warped sense of how much work is actually involved in building large-scale software systems.

2 comments

I’m also skeptical any time someone mentions 5 nines uptime at a small scale. For one, it takes a lot of engineering to be able to actually monitor and detect that with precision (that is, how do you know you are 99.999 and not 99.995?) and with so few people there may be holes in what’s monitored (there are so many places you can drop requests/whatever and lose availability). There’s also tail risks like datacenter incidents (if your servers are on three racks in two data centers) or dependencies like power outages that you may be getting lucky on avoiding due to small scale, rather than amortizing over a huge fleet - that is to say, if there is a 1% risk per year that one of your racks goes down and takes you 3 9s when that happens, you are really at slightly under 4 9s, but with only a few racks it doesn’t happen most years.

That last one is I suspect what makes it so small scale operators can achieve “5 9s” with a fraction of the engineering of larger operators. You can get a lot of 9s most years because you dodge infrequent risks.

> I’m also skeptical any time someone mentions 5 nines uptime at a small scale. For one, it takes a lot of engineering to be able to actually monitor and detect that with precision (that is, how do you know you are 99.999 and not 99.995?)

We have that in place. We run phone systems that businesses depend on, and we have SLAs that guarantee this uptime in order to secure customers. We have network engineers dedicated to everything from guaranteeing it on the cloud side to checking Wireshark traces for any hint of abnormalities, every day. When I say 3 people, I mean those of us writing the front-end, back-end, database procs, and code that the open-source libraries require, including forking and custom patches. We have other team members ensuring our HA pairs, load balancing, redundancy, fail-overs, and all the other associated technology is working as expected.

I won't get into the details, but we have not violated our SLAs, ever.

And you'd be surprised at what open-source software we are using to drive parts of this system. Kudos to them, they are helping us maintain this with some rock-solid software.

I think you cannot really salvage this argument. The way you describe it makes it sound to me that your company's success is ore likely due to luck and not just competence. It's also not clear if you really think this is a model that can scale to the size of Meta or if you just wanted to slide in a humblebrag.
> It's also not clear if you really think this is a model that can scale to the size of Meta or if you just wanted to slide in a humblebrag.

It can't. It depends on "software services" hosted and coded by "other software companies" so his infra's SLA is basically outsourced to either Amazon, Microsoft or Oracle.

> It depends on "software services" hosted and coded by "other software companies" so his infra's SLA is basically outsourced to either Amazon, Microsoft or Oracle.

It also depends on an incredible amount of in-house custom code.

If I deploy a change that breaks our Angular front-end, or a C# API change that has a typo that routes calls to the wrong places, or a configuration file for our open-source software that handles the phone systems, how exactly do we tell our thousands of customers they can't run their call centers? Or our restaurant customers can't take orders because their phone systems are dead?

Let's not be ridiculous. I'm not humble-bragging. I'm telling you what I do at my job.

Arent most peoples at this point?
> I don't really understand why software engineers keep dunking on each other like this.

I'm not dunking on anyone, I'm explaining my day-to-day job with a very small number of senior engineers.

We run a highly complex system. There's no way we could handle millions of calls with five-nines otherwise.

I won't get further into the details because I don't want to reveal too much PII.

I fully understand how much "behind the scenes" work goes on at a place like Facebook. I'm not sitting here imagining rooms of graphic designers thinking what the CSS button radius should be (although I'm sure with 85,000 employees, those happen also).

But note that Musk walked into Twitter, fired en-masse, and it still seems the same to me.

Yes, there are senior engineers who keep the core functionality working.

That is still far less than 85,000 employees.

In our case, it's 3 of us handling all software development. And we write a lot of mission critical code.