| $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that. Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever. If there's an issue with the server while they're sick or on vacation, you just stop and wait. If they take a new job, you need to find someone to take over or very quickly hire a replacement. There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime. Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward. |
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
> If there's an issue with the server while they're sick or on vacation, you just stop and wait.
Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.