Hacker News new | ask | show | jobs
by peterangular 2025 days ago
When I was working in a stack like this I found people spending outstanding amounts of time not actually working to improve the stability/performance of the application. The reason you triggered that memory was "GKE, Prometheus, Sysdig, Grafana, ELK" - that's exactly what we were dealing with. The support infrastructure/compute needs for it far exceeded the 20-30 hosts that actually needed to be there to operate the application.

We either were using someone else's prebuilt orchestration for something like ELK (insecure, needs constant auditing to be OK) or rolling it ourselves (very expensive in engineer time). None of it was ever working 100% and that was because we were jumping at software packages no one had really taken the time to fully understand. The mentality was "it's containerized!" which many on my team took to mean "we don't need to really grok it, it's in a container!" That burnt us, both on our TIG and ELK stacks. I left that job because it became putting out dumb fires that were not business-justifiable.

All-in-all I'm not saying what anyone is doing is wrong, I'm just saying that if you're going for an orchestrated environment like this you have to have a very mature team. You have to really care about learning these services well, and you have to be careful to not let your own architecture take your time away from solving real problems for the business.

The team I was on did not have that maturity outside of a couple bitter/broken ops guys who didn't deserve what the team had done to them while buzz-word driven leadership gutted their very-proven and stable VMWare infra into a total cluster-f K8s setup because "that's what we're suppose to do in 2018! That's what the new engineers want to work in!"

> the "operating system" has moved up a stack

Splitting hairs: The OS is still the same. The "stack" is newly imposed abstraction on-top of already established paradigms where we are trying to abstract ourselves away from the OS. It's distributed compute more than it is the "OS moving up a stack".

Edit: Ha I think you may have edited your comment with the Coinbase article. That article is actually what I point people to when explaining that K8s isn't some golden bullet, I personally think Coinbase is a great compromise in leveraging containers without going off of the rails (as they write about, ex: talking about the need for dedicated "compute" teams etc).

3 comments

"outside of a couple bitter/broken ops guys who didn't deserve what the team had done to them"

Hey, that's me.

Well, if its any consolation, programmers aren't safe either. In fact, we are highly paid obnoxious people (to execs) and just imagine being able to replace one of us with a box that works 24 hours, 7 days a week for the cost of the hardware, electricity, and network. How exciting! If and when that happens, I suppose I can find work as a (bad) carpenter or something.
I mostly agree with all of what you've said here. In our case, it's not unusual for a single customer environment to surge to 200-300 instances of an underlying compute server, and then scale back down to 20-30 at steady state. With 30 customer environments, you might have customers running from anywhere as low as 15 containers to as many as 500+, with a lot of dynamic flux depending on data ingestion and ETL.

K8S is in flux, so you still have to have a few top-end SRE types to manage your kube environment - the acceleration / maturity of the ecosystem is incredible though, so, sometime in the next 3-4 years, we'll start to see things get standardized enough that the wizardry required to keep it running will become a more commodity skill set.

And, more importantly, most of the ecosystem is fairly identical between azure/google/AWS - so porting or going multi-cloud is usually a weeks effort if that's something you want to do.

By "Moving up the Stack" - Of course I understand that cgroups/linux underpins it all - it's just that we're not using linux system binaries to manage the containers directly.

I mean tasks like process, storage, memory, CPU, resource utilization isn't something we tweak/query with OS commands, rather we're sending request/limit configurations to kube, and let it worry about managing the resources, relying on PromQL to monitor resource utilization, etc...

> we'll start to see things get standardized enough that the wizardry required to keep it running will become a more commodity skill set

I am so ready for this!

Constant scale-up/scale-down and dynamic load is what I jump at K8s for personally. Totally see the use-case for what you're talking about.

All-in-all I love K8s and containers, use them myself, and have been really happy with the results. It's just when I've worked with it professionally I don't find my colleagues typically have the skillset (not the fault of the tech).

GKE is the container/kubernetes engine, but then you also mention Prometheus, Sysdig, Grafana, ELK.

Sounds like much of the problem was the monitoring stack, just curious why you blame that on containers and k8s? Wouldn't you still have needed a solution for that for 20-30 hosts regardless of how you're orchestrating/running the applications?

> just curious why you blame that on containers and k8s?

Crawl, walk, run sorta stuff. We had never just gotten the application/monitoring/everything humming on pure Linux hosts skipping that entirely because "K8s and containers!" When you haven't properly QA'd, vetted, whatever your stack throwing heavy abstraction at it (containers/K8s) is an anti-pattern.

Most companies don't have the resources to run a competent K8s distributed compute infrastructure and as a hiring manager (as much as an IC) I know I have to hire very specific, very expensive people for that role. Good ops folks come with experience in their realm and the newer the tech stack the harder it is to find competent help due to talent market conditions.

I don't blame containers and K8s - I blame the people, and I blame companies/teams for jumping at new tech that often doesn't have a justifiable use-case outside of "we're doing the popular thing!" vs. really considering what the needs of the solution are.

I also have a very low tolerance for downtime, and with those huge abstractions I find stuff gets missed more often, leading to my application being down for my users. I am a KISS engineer.

Fair enough, I definitely can relate to that line of problems.

"Should we spend the time to do a thorough look at our monitoring needs, figure out where the gaps are, be more disciplined about using the tool consistently, etc.?"

"That'll take too long, I heard about this shiny new ops tool that claims to require zero configuration, let's just drop this in instead!"