Hacker News new | ask | show | jobs
by datadeft 869 days ago
This is insane:

     The Root CA certificate, etcd certificate, and API server certificate 
     expired, which caused the cluster to stop working and prevented our 
     management of it. The support to resolve this, at that time, in kube-aws 
     was limited. We brought in an expert, but in the end, we had to rebuild the 
     entire cluster from scratch.
I can't even imagine how I could explain any of my customers such an outage.
4 comments

I am somehow puzzled on how this goes into a downtime as implied in the article.

The control plane can be fully down - as in: you can shut it down - and everything continues to run. I've been in that situation multiple times with large clusters. E.g. one etcd node having disk issues, pretty much turning it into a lame service (even worse than true down). Kubelets got randomly regarded as non-healthy due to update latency. But everything continued to run. Another time API servers were leaking memory, and with that crashing, causing some herding that would crash each server as it comes up. No issues whatsoever, migrate to larger instances.

It's a pretty cool feature of kubernetes.

I am wondering what was done to let this cascade like this. The only thing I could imagine is that someone _wiped_ the etcd state, then brought it up, casing all things to go down.

It goes into downtime because the pods churn their containers, nodes come and go, and attempting to "remediate" in ways that cause deployment churn then cause services to go down without them being able to come back up. Same for any internal k8s component that relies on a certificate. It may "stay up" for a bit, but the cluster is still broken, and it gets increasingly more brokener. It's like trying to fix a flat tire on a truck that is dangling over a cliff.
Just in last couple of years I can recall DataDog being down for most of the day and Roblox took something like 72h outage. If huge public companies managed, you probably can too. I'd argue that unless real monetary damage was done it's actually worse for the customer to experience many small-scale outages than a very occasional big outage.
Well the industry analysts and consultants who develop metrics have decided that multiple outages is the way to go as it keeps people on toes more often. And management likes busy people as they are earning their keep.
IIRC Roblox was using Consul
You give them a month of free credits. As is the cloud way.
“us-east-1 was down” :)
If most infra I worked on was a single region one, sure. :) DR is so much easier in the cloud. You can have ECS scale to 0 in the DR site and when us-east-1 goes down just move the traffic there. We did that with amazon.com before AWS even existed. With AWS it became easier. There are still some challenges, like having a replica of the main SQL db if you run a traditional stack for example.