The main issues we faced with over 700VMs were: outdated os, full disks, full inodes, broken hardware, missing backups or missing backup strategy, oom.
K8s health itself, fixes out of memory by restarting a pod, solves storage by shipping logs out and killing a pod in case it still runs full, has a rollout startegy, health checks and readiness probes.
It provides easy deployment mechanism out of the box, adding a domain is easy, certificates get renewed centrally and automatically.
Scaling is just a replica number and you have node Autoupgrade features build in.
K8s provides what people build manually out of the box, certified, open sourced and battle tested.
> Alone the Paradigma shift from doing things step by step vs describing what you need and than things happen on it is a game changer.
I've actually used both in conjunction and it was decent: Ansible for managing accounts, directories, installed packages (the stuff you might actually need to run containers and/or an orchestrator), essentially taking care of the "infrastructure" part for on-prem nodes, so that the actual workloads can then be launched as containers.
In that mode of work, there was very little imperative about Ansible, for example:
- name: Ensure we have a group
ansible.builtin.group:
name: somegroup
gid: 2000
state: present
- name: Ensure that we have a user that belongs to the group
ansible.builtin.user:
name: someuser
uid: 3000
shell: /bin/bash
groups: somegroup
append: yes
state: present
This can help you setup some monitoring for the nodes themselves, install updates, mess around with any PKI stuff you need to do and so on, everything that you could achieve either manually or by some Bash scripts running through SSH. Better yet, the people who just want to run the containers won't have to think about any of this, so it ensures separation of concerns as well.
Deploying apps through Ansible directly can work, but most of the container orchestrators might admittedly be better suited for this, if you are okay with containerized workloads. There, they all shine: Docker Swarm, Hashicorp Nomad, Kubernetes (K3s is really great) and so on...
I'm on GKE. The hosts and control plane are managed for me. All I need to do is build/test/security scan images and then promote/deploy the image (via Helm) when it goes out to prod.
Using config management and introducing config drift and management of the underlying operating system is a lot more to think about, and a lot more that can go wrong.
So you did automatisation in a broken way. Here's one way to avoid the issues you described on bare metal:
- Only get servers with IPMI so you can remote reboot / power cycle them.
- Have said servers netboot so they always run the newest OS image.
- Make sure said OS image has a config that isn't broken so you don't get full inodes and so it cycles logs.
- Have the OS image include journalbeat to ship logs.
- Have your health checks trigger a recovery script that restarts or moves containers using one of a myriad of tools; monitoring isn't exactly a new discipline.
Yes, it means you have to have a build process for OS images. Yes, it means you need to pick a monitoring system. And yes, it means you need to decide a scheduling policy.
I wrote an orchestrator pre-K8S that was fewer LOC than the yaml config for my home test K8S cluster. Writing a custom orchestrator is often not hard, depending on your workload, - writing a generic one is.
K8S provides one opinionated version of what people build manually, and when it's a good fit, it's great. When it isn't, I all to often see people spend more time trying to figure out how to make it work for them than it would've taken them to do it from scratch.
I ran 1000+ VMs on a self developed orchestration mechanism for many years and it was trivial. This isn't a hard problem to solve, though many of the solutions will end up looking similar to some of the decisions made for K8S. E.g. pre-K8S we ran with an overlay network like K8S, and service discovery, like K8S, and an ingress based on Nginx like many K8S installs. There's certainly a reason why K8S looks the way it does, but K8S also has to be generic where you can often reasonably make other choices when you know your specific workload.
The main issues we faced with over 700VMs were: outdated os, full disks, full inodes, broken hardware, missing backups or missing backup strategy, oom.
K8s health itself, fixes out of memory by restarting a pod, solves storage by shipping logs out and killing a pod in case it still runs full, has a rollout startegy, health checks and readiness probes.
It provides easy deployment mechanism out of the box, adding a domain is easy, certificates get renewed centrally and automatically.
Scaling is just a replica number and you have node Autoupgrade features build in.
K8s provides what people build manually out of the box, certified, open sourced and battle tested.