| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Aurornis 100 days ago

$120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.

If there's an issue with the server while they're sick or on vacation, you just stop and wait.

If they take a new job, you need to find someone to take over or very quickly hire a replacement.

There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.

Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward.

7 comments

Figs 100 days ago

> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.

> If there's an issue with the server while they're sick or on vacation, you just stop and wait.

Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.

roryirvine 100 days ago

So the entire business was happy to go offline for 2/3 weeks whenever their infra person fancied going off on their summer holiday?

By doing this, you're guaranteeing a bus factor of below 1. I can't think of any business that wouldn't see that as being a completely unacceptable risk.

Aurornis 99 days ago

I agree.

I never understand the drive to stay away from cloud services for small scale operations. It’s not your money that’s being spent on the cloud, but it is your free time being asked to be on call when you encourage your company to self-host!

jononor 99 days ago

Bus factor 1 is rarely enough for "entire business". But if the GPUs are for training models, and their users are the data scientists that are also on holiday around the same times - that might indeed be good enough policy.

Aurornis 99 days ago

> and their users are the data scientists that are also on holiday around the same times

I’ve seen this before. It turns into restrictions on when you can schedule vacation times.

Not fun when your family wants to go on a trip but you can’t get the time off because it’s not one of the allowed vacation times.

jononor 99 days ago

Ouch, that is indeed a risk one must be wary of. Can be a "works for the company but sucks for employees". Which can also drain the company of skilled people, a poor trade in most cases.

justsomehnguy 100 days ago

If a business which require at least a quarter million bucks worth of hardware for the basic operation yet it can't pay the market rate for someonr who would operate it - maybe the basics of that business is not okay?

stego-tech 99 days ago

This.

Companies following consultant reports will usually end up offering 50% ranges, which for SRE/SIE roles in major metros comes to around $163k. If they study BLS/FRED/CPI data and aim to pay someone enough for a 50/30/20 budget in a major metro at median rent, they’ll offer $175k to $200k+. If they want someone to stick around, buy an average home, lay roots, it’s $210k+, minimum.

“Six figures” doesn’t cover essentials anymore for almost every major city in the USA, and the last thing you can afford to cheap out on is the labor supporting your IT infra. Every corner you cut today on TC (outsourcing, offshoring, consulting) is just letting fires rage until you either parachute out or everything burns down, and that’s not a game you can afford to play with critical business technologies.

Aurornis 99 days ago

I’m not disagreeing. I’m explaining to the commenter above that $120K isn’t going to cover the costs of a full-time SRE who will be on call 24/7

If a business can’t afford a properly staffed crew with enough allowance to cover a rotation of on call duties and allow for vacations, they should prefer the managed cloud services.

You’re paying more but you’re buying freedom and flexibility.

Manuel_D 100 days ago

> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one.

You can still use cloud for excess capacity when needed. E.g. use on-prem for base load, and spin up cloud instances for peaks in load.

stego-tech 99 days ago

This is my favorite use of the public cloud: the modern-day “hot site”. It’s way cheaper to just pay reserved rates for failover instances of critical infra than a whole other unused site, assuming your particular compliance or regulatory frameworks allow it. Especially in an era of remote work, it’s highly practical and cost-effective.

PunchyHamster 100 days ago

> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.

they come with warranty, often with technican guaranteed to arrive within few hours or at most a day. Also if SHTF just getting cloud to augument current lackings isn't hard

formerly_proven 100 days ago

> There's a second bus factor: What happens when that 8xH100 starts to get flakey?

These come in a non-flakey variant?

spwa4 100 days ago

It's called a warranty.

And the other argument: every company I've ever know to do AWS has an AWS sysadmin (sorry "devops"), same for Azure. Even for small deployments. And departments want their own person/team.

Aurornis 99 days ago

You can tell in this thread who has and who hasn’t had to work with this hardware.

My favorite are the responses from people saying the warranty will have someone show up in “hours” and fix it. Best of luck to you.

stego-tech 100 days ago

Out of all the comments on numbers, SREs, and scaling, you get the response for meeting numbers with numbers!

> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.

Know your market, and pay accordingly. You cannot fuck around with SREs.

> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.

This is less of an issue than you might think, but strongly dependent upon the quality of talent you’ve retained and the budget you’ve given them. Shitbox hardware or cheap-ass talent means you’ll need to double or triple up locally, but a quality candidate with discretion can easily be supported by a counterpart at another office or site, at least short-term. Ideally though, yeah, you’ll need two engineers to manage this stack, but AWS savings on even a modest (~700 VMs) estate will cover their TC inside of six months, generally.

> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.

This strikes at another workload I neglected to mention, and one I highly recommend keeping in the public cloud: GPUs.

GPUs on-prem suck. Drivers are finnicky, firmware is flakey, vendor support inconsistent, and SR-IOV is a pain in the ass to manage at scale. They suck harder than HBAs, which I didn’t think was possible.

If you’re consuming GPUs 24x7 and can afford to support them on-prem, you’re definitely not here on HN killing time. For everyone else, tune your scaling controls on your cloud provider of choice to use what you need, when you need it, and accept the reality that hyperscalers are better suited for GPU workloads - for now.

> Going on-prem like this is highly risky.

Every transaction is risky, but the risk calculus for “static” (ADDS) or “stable” (ERP, HRIS, dev/test) work makes on-prem uniquely appealing when done right. Segment out your resources (resist the urge for HPC or HCI), build sensible redundancies (on-prem or in the cloud), and lean on workhorse products over newer, fancier platforms (bulletproof hypervisors instead of fragile K8s clusters), and you can make the move successful and sensible. The more cowboy you go with GPUs, K8s, or local Terraform, the more delicate your infra becomes on-prem - and thus the riskier it is to keep there.

Keep it simple, silly.

throwup238 100 days ago

> Out of all the comments on numbers, SREs, and scaling, you get the response for meeting numbers with numbers!

>> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

> Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.

That's $120k per pod. Four pods per rack at 50kW.

What universe are we living in that a single SRE can't manage even a single rack for less than half a million in total comp?

stego-tech 99 days ago

> What universe are we living in that a single SRE can't manage even a single rack for less than half a million in total comp?

The kind where TC isn’t measured by pod managed, but by person hired. Also the world where median rent in major metros is $3500 a month.

If you think $120k is rich, you’re either operating in the boonies, outside the USA/Canada, or incredibly out of touch with the cost of living today and need to seriously go study BLS/FRED/CPI data sets to understand how expensive it is to live right now.

ragall 99 days ago

> outside the USA/Canada

Indeed, there's no reason for a company to host this kind of batch compute in North America. You can get very good people in Eastern Europe at 1/3 the cost.

Aurornis 99 days ago

I like how this simple claim about being cheaper to self-host a single server has now escalated to opening an office in Eastern Europe and hiring people there to manage it.

ragall 99 days ago

The trend of opening offices in Europe started one year into Covid. I'm sure that there are companies that haven't opened an office there yet, but fewer than one might imagine.

lazylizard 100 days ago

i am not sre, merely sysadmin.

and somehow i have this impression that gpus on slurm/pbs could not be simpler.

u can use a vm for the head node, dont even need the clustering really..if u can accept taking 20min to restore a vm.. and the rest of the hardware are homogeneous - you setup 1 right and the rest are identical.

and its a cluster with a job queue.. 1 node going down is not the end of the world..

ok if u have pcie GPUs sometimes u have to re-seat them and its a pain. otherwise if ur h200 or disks fail u just replace them, under warranty or not...

stego-tech 99 days ago

That sounds way easier than the methods I’ve had to manage GPUs in the Enterprise on-prem thus far (PCIe cards slotted into hypervisor boxes and shared via SR-IOV). I’ll have to look into it, but I doubt it’ll ever enter my personal wheelhouse given how quickly GPU-based workloads are either moved to the cloud for effective utilization at scale, or onto custom accelerators for edge workloads/inference.

charcircuit 100 days ago

>If there's an issue with the server while they're sick or on vacation, you just stop and wait.

You can ask AI to troubleshoot and fix the issue.