| "The usual story. Everything works, until it doesn't." But the same is true of all the other orchestration tools isn't it? I've had similarly complicated problems with Terraform, Ansible, Chef and Puppet and just plain Linux as I have had with Kubernetes. Meanwhile K8S saves a lot of time when things do work properly - which is nearly always. A while ago, we had an issue with dotnet where the JIT was sometimes emitting bad instructions and crashing the process. That was an absolute bloody nightmare to debug and reproduce, it took weeks - but nobody would say running a high level language is bad because the compiler might have a bug, right? We are a small company (under 20 developers), we have one dedicated ops person and one devops, and have never had any issues with k8s that couldn't be resolved by one of them within a few hours. We run a monorepo with 6 app services as part of our core product, 10 beta/research services, then a handful more infrastructure services (redis etc.), and honestly it's been the smoothest environment I've ever worked with. All the infrastructure is defined in code, can be released onto a blank AWS account (or k3s instance) in minutes, all scales up and down dynamically with load, and most of the time something goes wrong it's a bug in our code. Maybe the problem with your system was more about the excessive use of microservices and general system architecture rather than Kubernetes itself? |
Of course. The difference being how complicated it is to deal with problems.
For example I find it is way easier to deal with problems with Ansible compared to Chef.
So, assuming that both get me what I need, I prefer Ansible because it is less drag for when I have least time available to babysit it (which usually happens at least opportune moment).
What I am trying to say is that, just because it works for you now doesn't mean it will not end in a disaster at some time in the future. It is not my position to tell you if the risk is acceptable for you. But I personally try to avoid situations from which I cannot back easily.
If I have a script that starts an application on a regular VM I KNOW I can fix it whatever may happen to it. Not that I advocate running your services with a script, I just mean there is a spectrum of possible solutions with tradeoffs and it is good to understand those tradeoffs.
Some of those tradeoffs are not easily visible because they may only show themselves in special situations or, opposite, be so spread over time and over your domain that you just don't perceive the little drag you get on everything you do.
I find that if there is any overarching principle to build better solutions it is simplicity.
Presented with two solutions to the problem, the simpler solution is almost always better (the issue being the exact definition what it means to be simpler).
For example, I have joined many teams in the past that had huge problems with their applications. I met teams that were very advanced (they liked overcomplicating their code) and I met teams that could barely develop (they had trouble even writing code, did not know or use more advanced patterns).
I found that it is easier to help teams that had huge problems but wrote stupid code because it is easier to refactor stupid simple code that beginner developers write than it is to try to unpack extremely convoluted structures that "advanced" developers make.
I think similar applies to infrastructure.
For example, when faced with an outage I would usually prefer simpler infrastructure that I know I understand every element of.