> In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.
Wow, so the word police brought down Reddit. Why on earth did someone think it a good idea to screw with existing names in running clusters in a cluster management tool?
I'm sure it was worth making breaking changes to make Kubernetes more "inclusive"... whatever that means. As if Kubernetes ever excluded anybody, or "master" referred to slavery in any way.
These terminology changes date back to the George Floyd protests, and instead of getting action to solve the actual problem, we got blog posts from GitHub about changing the default branch name, so they can feel like they're doing something and elevate their brand.
Iirc this shift was actually much older and started as a pre-gamergate idealogical proxywar. There were even various projects parodying wokeness in tech at the time to comment on it. For example C+= which was a C++ derivative that had more "inclusive" keywords
I want a progressive, inclusive society. I want a society that argues and fights for actual change, not just changing word usage. I bristle at the “master vs. main” debate not because I’m racist, but because it distracts from real impactful change. Every second spent arguing over whether “master” has meaning outside a slavery context is a second NOT spent on expanding educational equity or progressive taxation. The people who are benefiting from the status quo want us fighting over the dictionary.
I wonder if perhaps twitter's new ownership (and decrease in moderation) has impacted their activities? I wonder if perhaps it was an effort led by twitter employees because that sort of thing leads to greater use of twitter?
They've had since December 2020 to update their cluster, and the breaking-ness in 1.24 is called out in a section titled 'Urgent Update Notes'[0], and subtitled 'No, really, you MUST read this before you upgrade'.
So by 'word police' you mean 'admins who didn't bother to read the release notes for the last two years and just deployed straight to production while ignoring the release notes'.
Whatever your politics, breaking changes happen. Not reading the release notes and checking to see if anything affects you is just incompetence.
Imagine if Cisco or Juniper decided to swap master or remove slave from their code. Core routers going down because of word police terminology and an admin who missed it in the change log.
Sounds more like admins need to read over change logs better and properly test updates on dev environments before just blindly updating systems. Features get deprecated, APIs update, crypto algorithms get dropped, it's entirely on the admin to ensure an update will actually work with existing code and systems.
On a very critical system, I wanted to use a newer python module that fixed a very annoying bug in the much older version we were running. Of course the module required a much newer version of python too. I upgraded everything and found that a function in a built-in module I had been using was entirely deprecated, very bad since it was used all over my code. I ended up writing my own module to overload the deprecated function into the new proper way of doing what I needed with only a simple change to my import statements. If I had just properly read over change logs and ran the update on a dev system, I wouldn't have any downtime since I could have made a fix early.
This can be one reason to run the control plane not on k8s itself. When the control plane runs on k8s you can get these weird states where the control plane is borked and the system cannot recover.
Back when we built our own Kubernetes distribution around the Kube 1.6 era I had to fight really hard with our architect to let me run the control plane with systemd instead of within Kube. The extra nodes were considered to be “a waste of resources”.
But in the five or so years we ran that distro the control plane didn’t fail once. Posts like this make me glad I pushed for it.
Technically it already runs kinda “outside of the loop” using static/mirrored pods so it doesn’t go through scheduler assignment/kcm reconciliation loop. If they ran their reflectors that way it probably wouldn’t happen
I appreciate the transparency and detail in publishing this. With that said, the narrative style and wordy,casual language makes it harder to get to the meat (the five whys) than a typical postmortem.
This is a pretty funny "bug". Bring down those Nazi Kubernetes nodes. There's some humor in there somewhere... making a change to be inclusive results in Reddit going offline... mmmm.
I'm still waiting for people to rename "white paper".
Wow, so the word police brought down Reddit. Why on earth did someone think it a good idea to screw with existing names in running clusters in a cluster management tool?