Hacker News new | ask | show | jobs
by Figs 108 days ago
> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.

> If there's an issue with the server while they're sick or on vacation, you just stop and wait.

Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.

1 comments

So the entire business was happy to go offline for 2/3 weeks whenever their infra person fancied going off on their summer holiday?

By doing this, you're guaranteeing a bus factor of below 1. I can't think of any business that wouldn't see that as being a completely unacceptable risk.

I agree.

I never understand the drive to stay away from cloud services for small scale operations. It’s not your money that’s being spent on the cloud, but it is your free time being asked to be on call when you encourage your company to self-host!

Bus factor 1 is rarely enough for "entire business". But if the GPUs are for training models, and their users are the data scientists that are also on holiday around the same times - that might indeed be good enough policy.
> and their users are the data scientists that are also on holiday around the same times

I’ve seen this before. It turns into restrictions on when you can schedule vacation times.

Not fun when your family wants to go on a trip but you can’t get the time off because it’s not one of the allowed vacation times.

Ouch, that is indeed a risk one must be wary of. Can be a "works for the company but sucks for employees". Which can also drain the company of skilled people, a poor trade in most cases.