| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by arein3 1365 days ago
	You need a few people to maintain the servers, not a few thousand people.

5 comments

shawnb576 1365 days ago

It's not a few people.

Twitter runs their own data centers, which means they own all of the THOUSANDS of machines in them. These machines, and all of their parts, have shelf lives and CONSTANTLY need replacement.

When they are replaced you can't just go to Best Buy with a credit card. At scale VERY SMALL changes matter: oh look they changed something in the disk firmware and haha now your databases corrupt data one out of 1M writes.

New machines need to be tested, burned in, installed. Old ones need to be cycled out.

Same goes for power equipment, networking, all that.

Because you built your own data center, and you were an early scale company it ALSO means a huge percentage of your systems are home grown - asset management, deployment, health checking, metrics, you name it. There are no articles on Stack Overflow. There is no blog post. How that shit works is mostly a function of what people knew about it and, well, at least half of those people are now poof - gone.

This hasn't even gotten to the services themselves, many of which are now running without an owner or any person at the company who has ever looked at them before. The remaining people are now up to their eyeballs in drama, survivor syndrome, fear, and, oh yeah, the work of many of their laid-off peers.

Few people, pfft, give me a break.

link

WorldMaker 1365 days ago

Even if you don't think you need that much operations labor at scale (and I'm assuming you are drastically under-estimating Twitter's current scale), when you do a 50% layoff before you even know your exact bus factor and are assuming a 100x/1000x redundancy (somehow), what are the odds you lay off one of those "few" people that are critical to operations? How do you know you haven't thrown out the needles in that big of a haystack?

link

sillysaurusx 1365 days ago

Software engineers: we’re engineers

Also software engineers: be sure not to fire Ned or else the whole bridge might collapse

link

jljljl 1365 days ago

At the scale of a company like Twitter, the product and infrastructure are less like a static bridge and more like a complex living, evolving organism. So the analogy is not a very good one.

So your patient might be ok if you fire Ned, but if you try to make changes and a critical system goes down, it might take you a lot longer to fix things without the specialists in that system.

You could keep one specialist around for each system, but then you have a very small bus factor.

link

ZeroGravitas 1365 days ago

Bridges have ongoing inspection and maintenance work that will lead to collapses if you decide to just skip it for no good reason too.

link

vineyardmike 1365 days ago

It’s not one bridge it’s thousands of bridges, it’s just not know how critical each bridge is. Or how critical it becomes when another one is down.

Look at all the other major engineering failures in history, it’s always small things (a gasket) on a bad day (too cold?) that somehow works day after day until one day it magically doesn’t and you get the Apollo incident. Everything goes catastrophic over tiny things. Imagine if NASA fired half their team before that incident. The only guy who knew the gasket can’t get cold might not still be there because Ned got laid off.

link

WorldMaker 1365 days ago

Management: Ned didn't print enough code from the last 60 days. There weren't enough pages of paper. Firing Ned.

Software engineers: We did document he was a load-bearing Ned. He was Ops, of course he doesn't code regularly.

link

SamoyedFurFluff 1365 days ago

To be fair this is how a lot of industrial engineering works too. Ol joe retired and now we don’t know what’s that special modification we need to make to smooth the flight of planes, cause Joe just knew. This is a lot of military and airplane production.

link

johannes1234321 1365 days ago

Yes, but if you happen to fire "the wrong" people it can take a while till the remaining one understand that component well enough for a hot fix, which for the original team had been easy and if ops are wieder to the specific dev team one can assume there are a few components with little attention now in the mix.

link

arein3 1365 days ago

So then you have to keep 6k engineers on payroll?)

link

johannes1234321 1365 days ago

Depending on what you want to do. But it's unlikely you can fire 50% after a weekend in, even if you are willing to let quite a lot of projects die.

link

vineyardmike 1365 days ago

Maintain servers not maintain services. Twitter likely had thousands of services doing thousands of different things. At their scale yes you need to keep thousands of people on payroll at least to turn off all the “fluff”.

Even if you can refactor and simply their work to half the workload, you can’t do that within a week. Even the boring organizational stuff is crazy at this scale. They for sure slashed whole teams at once. Who turned off those services? Or if they’re meant to be running, who owns them, organizationally? Where is the code living, what repo, what part of the code base, when something goes wrong, what metrics are being watched? Overnight teams had to become responsible for twice the code/services, potentially stuff they have never seen before. Bloated or not, that’s not easy.

link

giaour 1365 days ago

If Twitter never changed the product again, you might be right that they could keep the ship afloat with a skeleton crew. But it doesn’t sound like that’s Musk’s plan. He wants pretty substantial feature changes, which normally means less service stability.

link