Hacker News new | ask | show | jobs
by arein3 1318 days ago
You need a few people to maintain the servers, not a few thousand people.
5 comments

It's not a few people.

Twitter runs their own data centers, which means they own all of the THOUSANDS of machines in them. These machines, and all of their parts, have shelf lives and CONSTANTLY need replacement.

When they are replaced you can't just go to Best Buy with a credit card. At scale VERY SMALL changes matter: oh look they changed something in the disk firmware and haha now your databases corrupt data one out of 1M writes.

New machines need to be tested, burned in, installed. Old ones need to be cycled out.

Same goes for power equipment, networking, all that.

Because you built your own data center, and you were an early scale company it ALSO means a huge percentage of your systems are home grown - asset management, deployment, health checking, metrics, you name it. There are no articles on Stack Overflow. There is no blog post. How that shit works is mostly a function of what people knew about it and, well, at least half of those people are now poof - gone.

This hasn't even gotten to the services themselves, many of which are now running without an owner or any person at the company who has ever looked at them before. The remaining people are now up to their eyeballs in drama, survivor syndrome, fear, and, oh yeah, the work of many of their laid-off peers.

Few people, pfft, give me a break.

Even if you don't think you need that much operations labor at scale (and I'm assuming you are drastically under-estimating Twitter's current scale), when you do a 50% layoff before you even know your exact bus factor and are assuming a 100x/1000x redundancy (somehow), what are the odds you lay off one of those "few" people that are critical to operations? How do you know you haven't thrown out the needles in that big of a haystack?
Software engineers: we’re engineers

Also software engineers: be sure not to fire Ned or else the whole bridge might collapse

At the scale of a company like Twitter, the product and infrastructure are less like a static bridge and more like a complex living, evolving organism. So the analogy is not a very good one.

So your patient might be ok if you fire Ned, but if you try to make changes and a critical system goes down, it might take you a lot longer to fix things without the specialists in that system.

You could keep one specialist around for each system, but then you have a very small bus factor.

Bridges have ongoing inspection and maintenance work that will lead to collapses if you decide to just skip it for no good reason too.
It’s not one bridge it’s thousands of bridges, it’s just not know how critical each bridge is. Or how critical it becomes when another one is down.

Look at all the other major engineering failures in history, it’s always small things (a gasket) on a bad day (too cold?) that somehow works day after day until one day it magically doesn’t and you get the Apollo incident. Everything goes catastrophic over tiny things. Imagine if NASA fired half their team before that incident. The only guy who knew the gasket can’t get cold might not still be there because Ned got laid off.

Management: Ned didn't print enough code from the last 60 days. There weren't enough pages of paper. Firing Ned.

Software engineers: We did document he was a load-bearing Ned. He was Ops, of course he doesn't code regularly.

To be fair this is how a lot of industrial engineering works too. Ol joe retired and now we don’t know what’s that special modification we need to make to smooth the flight of planes, cause Joe just knew. This is a lot of military and airplane production.
Yes, but if you happen to fire "the wrong" people it can take a while till the remaining one understand that component well enough for a hot fix, which for the original team had been easy and if ops are wieder to the specific dev team one can assume there are a few components with little attention now in the mix.
So then you have to keep 6k engineers on payroll?)
Depending on what you want to do. But it's unlikely you can fire 50% after a weekend in, even if you are willing to let quite a lot of projects die.
Maintain servers not maintain services. Twitter likely had thousands of services doing thousands of different things. At their scale yes you need to keep thousands of people on payroll at least to turn off all the “fluff”.

Even if you can refactor and simply their work to half the workload, you can’t do that within a week. Even the boring organizational stuff is crazy at this scale. They for sure slashed whole teams at once. Who turned off those services? Or if they’re meant to be running, who owns them, organizationally? Where is the code living, what repo, what part of the code base, when something goes wrong, what metrics are being watched? Overnight teams had to become responsible for twice the code/services, potentially stuff they have never seen before. Bloated or not, that’s not easy.

If Twitter never changed the product again, you might be right that they could keep the ship afloat with a skeleton crew. But it doesn’t sound like that’s Musk’s plan. He wants pretty substantial feature changes, which normally means less service stability.