| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by datadeft 869 days ago

This is insane:

     The Root CA certificate, etcd certificate, and API server certificate 
     expired, which caused the cluster to stop working and prevented our 
     management of it. The support to resolve this, at that time, in kube-aws 
     was limited. We brought in an expert, but in the end, we had to rebuild the 
     entire cluster from scratch.

I can't even imagine how I could explain any of my customers such an outage.

4 comments

treffer 869 days ago

I am somehow puzzled on how this goes into a downtime as implied in the article.

The control plane can be fully down - as in: you can shut it down - and everything continues to run. I've been in that situation multiple times with large clusters. E.g. one etcd node having disk issues, pretty much turning it into a lame service (even worse than true down). Kubelets got randomly regarded as non-healthy due to update latency. But everything continued to run. Another time API servers were leaking memory, and with that crashing, causing some herding that would crash each server as it comes up. No issues whatsoever, migrate to larger instances.

It's a pretty cool feature of kubernetes.

I am wondering what was done to let this cascade like this. The only thing I could imagine is that someone _wiped_ the etcd state, then brought it up, casing all things to go down.

link

0xbadcafebee 868 days ago

It goes into downtime because the pods churn their containers, nodes come and go, and attempting to "remediate" in ways that cause deployment churn then cause services to go down without them being able to come back up. Same for any internal k8s component that relies on a certificate. It may "stay up" for a bit, but the cluster is still broken, and it gets increasingly more brokener. It's like trying to fix a flat tire on a truck that is dangling over a cliff.

link

dilyevsky 869 days ago

Just in last couple of years I can recall DataDog being down for most of the day and Roblox took something like 72h outage. If huge public companies managed, you probably can too. I'd argue that unless real monetary damage was done it's actually worse for the customer to experience many small-scale outages than a very occasional big outage.

link

geodel 869 days ago

Well the industry analysts and consultants who develop metrics have decided that multiple outages is the way to go as it keeps people on toes more often. And management likes busy people as they are earning their keep.

link

badrequest 869 days ago

IIRC Roblox was using Consul

link

gonzo41 869 days ago

You give them a month of free credits. As is the cloud way.

link

bdangubic 869 days ago

“us-east-1 was down” :)

link

datadeft 869 days ago

If most infra I worked on was a single region one, sure. :) DR is so much easier in the cloud. You can have ECS scale to 0 in the DR site and when us-east-1 goes down just move the traffic there. We did that with amazon.com before AWS even existed. With AWS it became easier. There are still some challenges, like having a replica of the main SQL db if you run a traditional stack for example.

link