Hacker News new | ask | show | jobs
by jonas21 569 days ago
This is an interesting writeup, but I feel like it's missing a description of the cluster and the workload that's running on it.

How many nodes are there, how much traffic does it receive, what are the uptime and latency requirements?

And what's the absolute cost savings? Saving 75% of $100K/mo is very different from saving 75% of $100/mo.

1 comments

In my experience noone bothers unless they are using GPUs or they are already at 100k/mo.

I do think 100k/mo is the tipping point actually, that is $1.2M/yr.

It costs around $400k/yr in engineering salaries to reasonably support a sophisticated bare metal deployment (though such people can generally do that AND provide a lot of value elsewhere in the business, so really it's actual cost is lower than this) and about $100k/yr in DC commitments, HW amortisation, and BW roughly. So you save around $700k a year which is great but the benefit becomes much greater when your equiv cloud spend is even bigger than that.

If you want to do HA kubernetes, you need oncalls and at least 10 engineers to get a stable rotation.

If you do that in Europe you have to pay them during standby hours.

400k/year seems very low to me.

You really don't need all 10 people on-call to know k8s to that level. They just need to know enough as to when to wake someone else up.

Everywhere I have worked where we have run clusters in the 100s to 1000s of nodes we have rarely had a team larger than 4-5 of true k8s folks and even then it's been a split between folks that are very hardware provisioning/network/etc focused and more higher level k8s folk which also take on a large portion of CI/CD work also.

At smaller scale (in the $1M/yr ballpark) I have done all the k8s bare metal ops myself along with all CI/CD and been responsible for a ton of the backend programming too. This is feasible because with distros like Talos etc it doesn't take a lot of manpower once it's setup and upgrades aren't too painful at small scale if you aren't running stateful services.

So tbh no, you just need ideally 2 folks at around ~200k/yr each that are competent and have done it before. The rest of the folks on the on-call rotation are just the rest of your engineers (and if you are at $1m/yr cloud spend you have more than 10 of those).