| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by penglish1 110 days ago

I don't post much on HN, but this topic is near and dear to my heart, so here we go.

Context: been helpdesk, sysadmin & network admin, DevOps, Site Reliability Engineer, in that progression, starting in the 90's. Max on-prem was 40 racks, scaled up and down over years.

Many comments talking about how staffing is a key element of this equation that can't be overlooked, but I decided to reply to the root comment, which doesn't say whether/how it considers staffing.

This is a complex equation - and it is relatively easy to present an incomplete or misleading picture management to push the move into the cloud.. or out of the cloud.

Some factors, in no particular order:

1) Scaling: it is self-evident that pulling a single physical server worth out of the cloud is not worth it.. even for 288 cores. Or perhaps 1152 for 4xXeon in a single server. Still likely not worth it. Why? Because a single server is never just that. Someone has to swap components when it goes down. When it goes down.. ALL 1152 cores are down, along with everything they are doing. Is that acceptable for all applications running on all those cores? It is also appropriate supporting infrastructure - power, cooling, physical space. The "fairly obvious" minimum scaling is "enough servers that one can be entirely down for maintenance while keeping everything else running." But now you're paying for some overhead. At 2 servers, you're buying 2x what you need, half that capacity is idle all the time. And so on.

On this point - I think the other comments talking about "each SRE managing 4 (or 5, or 7)" racks missed the point entirely. SRE's should be doing scalable work, whether in the cloud, or on-prem. And they should NOT be swapping failed hard drives and power supplies. Designing a larger-than-one rack install is probably worth hiring consultants for if you don't have that expertise in-house, though the SREs that would be supporting it would need to supply lots of input. To some extent, server & network equipment vendors can also help. It is not trivial as the scale goes up. But then it should run for some years, with relatively unskilled people handling hardware failures and you can re-engage consultants if necessary to do upgrades as hardware and needs evolve.

But your SREs should be on-staff, and probably on-call to handle the software running on that hardware.. and to some extent to call the remote hands to deal with hardware failures.

2) Business needs: does the business need the tech skills that self-hosting requires for the core business? For example - if the business itself is cloud SAAS, maybe DIY-ing at least some of your infrastructure is right in your wheelhouse. If so - a modest increase in staff could mean a huge cost savings. But if not, all the cost of skilled staff to run it is simply part of the cost of in-housing this stuff.

3) Staffing: the people that swap broken hardware are not the same people that respond to pages because the business critical application crashed due to a bug. You can pay a colo facility for all this, typically by the hour - but it isn't cheap and you've got to supply all the spares etc. Is that part of your budget for on-prem?

4) On-call: maybe your self-hosted ERP system can be down every night and weekend without issues.. and even business hours can tolerate 98% uptime. But that doesn't mean you can get away without having someone on call - presumably you're hosting more than just this lowish-requirement ERP system. I'll disagree with other comments - the "no burnout" number of on-call staff you need is 6-7, not 4! Remember people take vacations too. This is well studied, and established, I'll reference Tom Limoncelli's books. This could be relatively cheap and require fewer staff with geo-distributed staff, and it would tend to overlap with staff you already use to provide on-call for anything you host on the cloud - so maybe for your situation it is close to a wash. But you can't forget to budget for it even if the line item is $0.

5) Vendor support: maybe you already have your own data center or colo and are hosting a ton of stuff. Why not move all your Atlassian stuff in house and save the hosting cost. Oh.. wups, Atlassian simply doesn't support that any more. Host it with Atlassian or GTFO. A minor point as most vendors would give you enough notice you can simply run out the lifetime of the hardware it is on and not replace it.

6) Market pricing: At one point Amazon was starting "by the minute, by the core cloud" (as opposed the older "cloudish" model of leasing only an entire physical server by the entire year) and priced a bit under market to get going. Then once they established dominance, they cranked the price WAY up for profit extraction. But now they do have some competition and they're a bit more selective about how they extract profits. In my perception they've shifted a lot of the profit taking to the value-added services rather than raw instance time, but I could be wrong. And they have HUGE costs - it is beyond naive to look at the per-hour cost of an instance and compare it to purchasing an identical physical server solely on the purchase price of that server. Corey Quinn / Duckbill group has spent a huge part of his career in this space - if you're already in the cloud 100% it is well worth optimizing those costs before you start comparing it to what on-prem might cost.