| So you did automatisation in a broken way. Here's one way to avoid the issues you described on bare metal: - Only get servers with IPMI so you can remote reboot / power cycle them. - Have said servers netboot so they always run the newest OS image. - Make sure said OS image has a config that isn't broken so you don't get full inodes and so it cycles logs. - Have the OS image include journalbeat to ship logs. - Have your health checks trigger a recovery script that restarts or moves containers using one of a myriad of tools; monitoring isn't exactly a new discipline. Yes, it means you have to have a build process for OS images. Yes, it means you need to pick a monitoring system. And yes, it means you need to decide a scheduling policy. I wrote an orchestrator pre-K8S that was fewer LOC than the yaml config for my home test K8S cluster. Writing a custom orchestrator is often not hard, depending on your workload, - writing a generic one is. K8S provides one opinionated version of what people build manually, and when it's a good fit, it's great. When it isn't, I all to often see people spend more time trying to figure out how to make it work for them than it would've taken them to do it from scratch. |