Not knowing the details, I wonder if this incident could be Kubernetes' Waterloo. All the supposed benefits of self-healing and management are the only thing bolstering its use, far as I understand.
It turned out that DO's base images for the worker nodes had automatic updates turned on; these were kicking in at around 6am each day, and causing the nodes to fail.
I'm sure their official incident report will have more details, but right now it looks like this is nothing to do with Kubernetes directly, but the underlying OS.
DO have now disabled those automatic updates so that this stops happening; it's been stable since then.
I'm sure their official incident report will have more details, but right now it looks like this is nothing to do with Kubernetes directly, but the underlying OS.
DO have now disabled those automatic updates so that this stops happening; it's been stable since then.