Reddit Releases Post Mortem for Its 3 Hour Outage Last Week | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Reddit Releases Post Mortem for Its 3 Hour Outage Last Week (old.reddit.com)
	109 points by Rebles 1228 days ago

6 comments

mwint 1228 days ago

> In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.

Wow, so the word police brought down Reddit. Why on earth did someone think it a good idea to screw with existing names in running clusters in a cluster management tool?

c7DJTLrn 1228 days ago

I'm sure it was worth making breaking changes to make Kubernetes more "inclusive"... whatever that means. As if Kubernetes ever excluded anybody, or "master" referred to slavery in any way.

tapoxi 1228 days ago

These terminology changes date back to the George Floyd protests, and instead of getting action to solve the actual problem, we got blog posts from GitHub about changing the default branch name, so they can feel like they're doing something and elevate their brand.

deadly_syn 1224 days ago

Iirc this shift was actually much older and started as a pre-gamergate idealogical proxywar. There were even various projects parodying wokeness in tech at the time to comment on it. For example C+= which was a C++ derivative that had more "inclusive" keywords

https://github.com/TheFeministSoftwareFoundation/C-plus-Equa...

testbjjl 1227 days ago

It’s good to see how increasingly fewer people feel.

londons_explore 1228 days ago

The word police seem to have gone kinda quiet lately.

I wonder if they realised that banning a few words wasn't really helping their cause.

Nextgrid 1228 days ago

Those bullshit positions are the first to go in any kind of recession or downturn, that's the most likely explanation.

danudey 1227 days ago

Or they don't see the point in arguing with people who are obviously against any sort of progressive or inclusive adjustments to society.

PraetorianGourd 1227 days ago

I want a progressive, inclusive society. I want a society that argues and fights for actual change, not just changing word usage. I bristle at the “master vs. main” debate not because I’m racist, but because it distracts from real impactful change. Every second spent arguing over whether “master” has meaning outside a slavery context is a second NOT spent on expanding educational equity or progressive taxation. The people who are benefiting from the status quo want us fighting over the dictionary.

londons_explore 1227 days ago

Their main home seemed to be twitter.

I wonder if perhaps twitter's new ownership (and decrease in moderation) has impacted their activities? I wonder if perhaps it was an effort led by twitter employees because that sort of thing leads to greater use of twitter?

testbjjl 1227 days ago

Was that BLM/ANTIFA HQ? The building will lose value because of all the woke air inside.

danudey 1227 days ago

They've had since December 2020 to update their cluster, and the breaking-ness in 1.24 is called out in a section titled 'Urgent Update Notes'[0], and subtitled 'No, really, you MUST read this before you upgrade'.

So by 'word police' you mean 'admins who didn't bother to read the release notes for the last two years and just deployed straight to production while ignoring the release notes'.

Whatever your politics, breaking changes happen. Not reading the release notes and checking to see if anything affects you is just incompetence.

[0] https://github.com/kubernetes/kubernetes/blob/master/CHANGEL...

hoseja 1228 days ago

I laughed out loud when I got to that point. Well deserved.

cf141q5325 1226 days ago

I started laughing but arent as far from crying as i would ike to.

lordloki 1227 days ago

Ironic.

testbjjl 1227 days ago

3 hours, IPO not impacted. People who don’t value inclusion identified. Win-win. Hope you’re enjoying Tucker Carlson (alone, likely) tonight.

post_break 1227 days ago

Imagine if Cisco or Juniper decided to swap master or remove slave from their code. Core routers going down because of word police terminology and an admin who missed it in the change log.

wildzzz 1227 days ago

Sounds more like admins need to read over change logs better and properly test updates on dev environments before just blindly updating systems. Features get deprecated, APIs update, crypto algorithms get dropped, it's entirely on the admin to ensure an update will actually work with existing code and systems.

On a very critical system, I wanted to use a newer python module that fixed a very annoying bug in the much older version we were running. Of course the module required a much newer version of python too. I upgraded everything and found that a function in a built-in module I had been using was entirely deprecated, very bad since it was used all over my code. I ended up writing my own module to overload the deprecated function into the new proper way of doing what I needed with only a simple change to my import statements. If I had just properly read over change logs and ran the update on a dev system, I wouldn't have any downtime since I could have made a fix early.

__turbobrew__ 1228 days ago

This can be one reason to run the control plane not on k8s itself. When the control plane runs on k8s you can get these weird states where the control plane is borked and the system cannot recover.

cpressland 1228 days ago

Back when we built our own Kubernetes distribution around the Kube 1.6 era I had to fight really hard with our architect to let me run the control plane with systemd instead of within Kube. The extra nodes were considered to be “a waste of resources”.

But in the five or so years we ran that distro the control plane didn’t fail once. Posts like this make me glad I pushed for it.

dilyevsky 1228 days ago

Technically it already runs kinda “outside of the loop” using static/mirrored pods so it doesn’t go through scheduler assignment/kcm reconciliation loop. If they ran their reflectors that way it probably wouldn’t happen

dehrmann 1228 days ago

I always find this sort of dogfooding to be academically clever, but operationally risky.

gundmc 1228 days ago

I appreciate the transparency and detail in publishing this. With that said, the narrative style and wordy,casual language makes it harder to get to the meat (the five whys) than a typical postmortem.

mynameisvlad 1228 days ago

The intended audience is probably a mix of engineers and regular Reddit users, hence the more casual tone.

ethicalsmacker 1227 days ago

This is a pretty funny "bug". Bring down those Nazi Kubernetes nodes. There's some humor in there somewhere... making a change to be inclusive results in Reddit going offline... mmmm.

I'm still waiting for people to rename "white paper".

hoseja 1228 days ago

314 minutes is not three hours.

suprjami 1228 days ago

It's 3.14 metric hours, close enough.

mlry 1226 days ago

I know here on HN I shouldn't, but I quite enjoyed this comment ...

Rebles 1227 days ago

Sorry. I realized it a minute after I posted :(