Hacker News new | ask | show | jobs
by noisenotsignal 1164 days ago
4 engineers for an entire phone company sounds scary. I’m sure your engineering is robust enough such that outages are minimal, but that still sounds like a lot of on call (rotation of 4 = once a month?). Even if you only get paged once every few months, you still need to worry about getting paged until your shift is over! Even if you don’t worry in the psychological sense, you still have to schedule around it.
2 comments

I actually screwed that up, we only have 3 total software engineers. Including myself.

We do have other employees who maintain the hardware, on-call DBAs to manage issues, etc. I'm only speaking to the software engineers.

And we have lots of hardware and lots of open-source solutions that handle the actual calls.

I'm full-stack but lead on the front-end, a complex Angular application to manage everything from huge call centers to small restaurants.

We have C# for APIs, cloud Oracle for the database, and a whole slew of other software and services to manage the actual calls.

Each of us is specialized in specific parts. Our up-time is tremendous given the amount of code we've written. It's extremely stable.

I've been at this 20+ years, as have the other 2. We know enough between us to get this done.

I've never been called after normal work hours. We release the updated front-end every week and haven't had any issues. And a lot of changes/improvements go into that ... that's a lot of my job.

It's well-architectured, well-tested, fault-tolerant software.

Sorry, but I'm skeptical that you actually run "an entire phone company" in the way that most people understand the term. You mention that you use lots of open-source solutions, and I'm guessing you outsource the build and operation of the network to a real phone company, probably similar to an MVNO. Am I wrong?

Meta is very different from that, they build the products that users interact with, but also build things at the bottom of the tech stack. At their scale, this makes business sense to do so, and comparing your headcount with theirs makes no sense.

I don't really understand why software engineers keep dunking on each other like this. I get that people want to broadcast how smart they are, but in reality we're just giving the general public a warped sense of how much work is actually involved in building large-scale software systems.

I’m also skeptical any time someone mentions 5 nines uptime at a small scale. For one, it takes a lot of engineering to be able to actually monitor and detect that with precision (that is, how do you know you are 99.999 and not 99.995?) and with so few people there may be holes in what’s monitored (there are so many places you can drop requests/whatever and lose availability). There’s also tail risks like datacenter incidents (if your servers are on three racks in two data centers) or dependencies like power outages that you may be getting lucky on avoiding due to small scale, rather than amortizing over a huge fleet - that is to say, if there is a 1% risk per year that one of your racks goes down and takes you 3 9s when that happens, you are really at slightly under 4 9s, but with only a few racks it doesn’t happen most years.

That last one is I suspect what makes it so small scale operators can achieve “5 9s” with a fraction of the engineering of larger operators. You can get a lot of 9s most years because you dodge infrequent risks.

> I’m also skeptical any time someone mentions 5 nines uptime at a small scale. For one, it takes a lot of engineering to be able to actually monitor and detect that with precision (that is, how do you know you are 99.999 and not 99.995?)

We have that in place. We run phone systems that businesses depend on, and we have SLAs that guarantee this uptime in order to secure customers. We have network engineers dedicated to everything from guaranteeing it on the cloud side to checking Wireshark traces for any hint of abnormalities, every day. When I say 3 people, I mean those of us writing the front-end, back-end, database procs, and code that the open-source libraries require, including forking and custom patches. We have other team members ensuring our HA pairs, load balancing, redundancy, fail-overs, and all the other associated technology is working as expected.

I won't get into the details, but we have not violated our SLAs, ever.

And you'd be surprised at what open-source software we are using to drive parts of this system. Kudos to them, they are helping us maintain this with some rock-solid software.

I think you cannot really salvage this argument. The way you describe it makes it sound to me that your company's success is ore likely due to luck and not just competence. It's also not clear if you really think this is a model that can scale to the size of Meta or if you just wanted to slide in a humblebrag.
> It's also not clear if you really think this is a model that can scale to the size of Meta or if you just wanted to slide in a humblebrag.

It can't. It depends on "software services" hosted and coded by "other software companies" so his infra's SLA is basically outsourced to either Amazon, Microsoft or Oracle.

> I don't really understand why software engineers keep dunking on each other like this.

I'm not dunking on anyone, I'm explaining my day-to-day job with a very small number of senior engineers.

We run a highly complex system. There's no way we could handle millions of calls with five-nines otherwise.

I won't get further into the details because I don't want to reveal too much PII.

I fully understand how much "behind the scenes" work goes on at a place like Facebook. I'm not sitting here imagining rooms of graphic designers thinking what the CSS button radius should be (although I'm sure with 85,000 employees, those happen also).

But note that Musk walked into Twitter, fired en-masse, and it still seems the same to me.

Yes, there are senior engineers who keep the core functionality working.

That is still far less than 85,000 employees.

In our case, it's 3 of us handling all software development. And we write a lot of mission critical code.

> I've been at this 20+ years, as have the other 2

> We release the updated front-end every week and haven't had any issues. And a lot of changes/improvements go into it

In 20 years zero issues? I don’t believe you.

> Even if you only get paged once every few months, you still need to worry about getting paged until your shift is over!

I'm not sure what you mean. A page every few months will be considered world-class achievement in a FAANG-like company. Take Amazon for instance, the oncall is brutal and getting several pagers per day is normal. Other companies may be better, but not one pager per few months better.

When you're on-call, you need to be prepared for a page. That means you can't get drunk with your friends, or get on a plane, etc. Being on-call means being ready to answer a page, whether it comes or not.

> A page every few months will be considered world-class achievement in a FAANG-like company.

Except maybe it's a different beast entirely. At amazon, they're constantly pushing new features at most teams. A stable phone company may just be handling pages for when hardware fails. Presumably there's bugs in fast-moving new code more frequently than hardware failure of a tiny org.

Also, fwiw I've been at amazon and had on-call rotations where we didn't get paged monthly. Your manager/team isn't allowing you to allocate resources to fixing your alarms or bugs if you're getting paged that often and not a crazy critical service.

I think all those big corps do a lot of change for the sake of change.

That's why we have so much churn in frameworks, etc.

If you're not "innovating" somewhere, you're not good enough to work there.

Traditional companies are more long term and stable with their tooling and decision making.