|
|
|
|
|
by datadeft
869 days ago
|
|
This is insane: The Root CA certificate, etcd certificate, and API server certificate
expired, which caused the cluster to stop working and prevented our
management of it. The support to resolve this, at that time, in kube-aws
was limited. We brought in an expert, but in the end, we had to rebuild the
entire cluster from scratch.
I can't even imagine how I could explain any of my customers such an outage. |
|
The control plane can be fully down - as in: you can shut it down - and everything continues to run. I've been in that situation multiple times with large clusters. E.g. one etcd node having disk issues, pretty much turning it into a lame service (even worse than true down). Kubelets got randomly regarded as non-healthy due to update latency. But everything continued to run. Another time API servers were leaking memory, and with that crashing, causing some herding that would crash each server as it comes up. No issues whatsoever, migrate to larger instances.
It's a pretty cool feature of kubernetes.
I am wondering what was done to let this cascade like this. The only thing I could imagine is that someone _wiped_ the etcd state, then brought it up, casing all things to go down.