Hacker News new | ask | show | jobs
by freedomben 2124 days ago
This is how OpenShift 4 does things. I too thought it was strange at first but now with some experience it's quite pleasant.

Can be a beast to debug though if you haven't done it before.

2 comments

Aside from being faster than replacing all of the hosts, the reason OpenShift does it this way is that you can't just burn down and replace a fleet of bare metal machines. While re-PXEing is possible, this takes a ton of time and stresses that infrastructure.

Doing the same on cloud, metal, OpenStack, VMware, etc means that your cluster's operational experience remains the same and in most cases is less disruptive.

edit: having your nodes controlled by your cluster has a number of other benefits aside from patching, like the Node Tuning Operator that can tweak settings based on the types of workloads running on that set of machines.

I can assure you that OpenShift doesn't take this path because it is "better". It does so because bare-metal is a significant part of their market and there isn't a better option to automate the process currently.

I once worked on a competing product (before the OS update operator was available) and the update-in-place model was always a disaster. Various problems like dns, service discovery, timeouts, breaking changes to dependency pkgs, etc make for a problematic process. Combine that with the frantic pace of k8s develeopment, short node compatibility window (2-3 minor k8s releases) and various CVEs - you end up debugging a lot of machines in unknown states that fail to rejoin clusters after reboots.

This has definitely not been my experience running many hundreds of Red Hat CoreOS nodes in production.

So far, aside from a few small flakey issues, having the cluster nodes _and_ the OpenShift cluster update in lock step has been dramatically simpler to manage.

Was that with a 1990s mutable package-based distro by any chance?