Hacker News new | ask | show | jobs
by dgrin91 1020 days ago
I know very little on the datacenter operations side of things - I guess 3 people is not a lot, but what is normal? How many operations people are at say AWS US-East-1? I presume it doesn't scale with number of servers, that would not scale well. What is a 'normal' level? 10? 100? It can't be more than 100, can it?
7 comments

I doubt any amount of staffing could address lack of specialism in dealing with power or air conditioning issues, both likely involve infrequent maintenance by external vendors. 20 people blowing up the phone to a vendor doesn't fix a problem any faster than 1 etc.
Amazon goes to the extreme of putting its own custom firmware on switchgear because the choices that vendor makes in theirs doesn’t align with their objectives.

I don’t think AWS is blowing up a vendors phone when something goes wrong in one of their facilities.

[1] https://www.datacenterknowledge.com/archives/2017/04/07/how-...

Amazon doesn't make their own AC units or generators, so it's still likely they would need external support for a case like this.
That's some magical thinking, thinking that you don't need hardware people because you wrote some software.

All the same gear is still there, its control functions are just ceded to Amazon panels. And integrated to the point of even removing some PLC like devices.

> I know very little on the datacenter operations side of things - I guess 3 people is not a lot, but what is normal?

Bear in mind that outside of the US and maybe one or two other locations ex-US, almost all of the magic cloud operates out of third-party datacentres, not their own.

They will have a small office on-site where 3–5 people sit, and those people are exclusively dedicated to the cloud equipment itself. The datacentre ops side is, by definition, handled by the third-party datacentre operator.

The guys onsite are clearly only there for "intelligent hands" purposes, as everything else will be done remotely from silicon valley or wherever.

us-east-1 is many, many datacenters across its Availability Zones and tens of billions of investment for Amazon Web Services.

Across all the datacenters the number of operations personnel likely exceeds 100. Think of the unit of scale as a datacenter, with an availability zone potentially containing 10+ of those.

[1] https://www.datacenterfrontier.com/cloud/article/11427911/aw...

0 - 2 staff in a typical DC is not unusual at all, with people who are on-call usually within a 30 minute drive.

Larger DCs can and do have more staff on-site 24/7 and typically the amount of staff on-site at any given time is driven by SLAs.

I expect the DC in TFA to return to lower staff levels once they've worked on reducing their total "time to restart chiller" or reduced the amount of manual work involved in doing so.

Still we read how DCs are a job generator. E.g. "One hopes for hundreds of jobs for locals." https://cryptoquorum.com/oman-opens-cryptocurrency-mining-ce...
These usually mean temporary construction jobs. Politicians don’t like to point that out.
Yeah, that's counting construction jobs like the sibling said, or just unfounded optimism.

A datacenter with tight SLAs probably needs one 24/7 tech and one 24/7 security guard. That's acheivable with 5 jobs for each, so 10 jobs total. Maybe a couple more techs if the ticket volume is high. Never going to be hundreds of permanent jobs.

Yeah Northern Virginia trapped themselves with that thinking. They can't stop building Datacenters, less they lose tens of thousands of jobs in the region. And the Datacenters know this, that they can beat concessions out of politicians for it.
I've heard of the big-cos having to use bonafide robots for doing manual tasks in a data center like replacing broken drives or swapping tapes etc. I think there is still a bunch of manual tasks to be done.

That said I have no idea. When I worked (many many many many years ago) in a small DC that is perhaps the size of a 2bed apartment we had 4 guys scurrying about doing stuff (hands-on-keyboard, routing cables, replacing hardware etc). This was way before Docker & Kubernetes et al - physical iron and all that. I would assume that in modern DC ops you could run a football field sized DC with less than 10 people due to automation. But that said if part of the actual infrastructure like power or cooling fails, you need to have the right skill-set in place. If the cooler's failed and couldn't just be turned off and on again, we would have been out of luck in my old DC days and would need to call someone in and just hope the servers didn't fry in the meantime. Sounds like a similar deal here.

Smaller operations usually need more manpower.
You don't need many on-site staff. 99% of tasks can be performed remotely. The only ones that can't involve physically moving equipment, which doesn't happen that often. And if you're doing a big build out you can bring in extra staff for the couple of weeks that takes.
Over 10k servers here, a couple dozen locations scattered around the globe. One full time operations person.

They go on site to geographically adjacent DCs and outside that just travel onsite for special projects.