Hacker News new | ask | show | jobs
by treffer 869 days ago
I am somehow puzzled on how this goes into a downtime as implied in the article.

The control plane can be fully down - as in: you can shut it down - and everything continues to run. I've been in that situation multiple times with large clusters. E.g. one etcd node having disk issues, pretty much turning it into a lame service (even worse than true down). Kubelets got randomly regarded as non-healthy due to update latency. But everything continued to run. Another time API servers were leaking memory, and with that crashing, causing some herding that would crash each server as it comes up. No issues whatsoever, migrate to larger instances.

It's a pretty cool feature of kubernetes.

I am wondering what was done to let this cascade like this. The only thing I could imagine is that someone _wiped_ the etcd state, then brought it up, casing all things to go down.

1 comments

It goes into downtime because the pods churn their containers, nodes come and go, and attempting to "remediate" in ways that cause deployment churn then cause services to go down without them being able to come back up. Same for any internal k8s component that relies on a certificate. It may "stay up" for a bit, but the cluster is still broken, and it gets increasingly more brokener. It's like trying to fix a flat tire on a truck that is dangling over a cliff.