> “Sometimes when we do an etcd failover, the API server starts timing out requests until we restart it.”
This is likely related a set of Kubernetes bugs [1][2] (and grpc[3]) that CoreOS is working diligently to get fixed. The first set of these, the endpoint reconciler[4], has landed in 1.9.
More work is pending on the etcd client in Kubernetes. The good news is that the client is used everywhere, so one fix and all components will benefit.
I don't get this. Didn't Kubernetes come out of Google Borg that had been in use forever? The second write should be more elegant and impressive -- why so many basic bugs?
Kubernetes takes some concepts from Borg. A system like Borg would be very closely coupled to Google‘s infrastructure that there’s probably very little to open source from there without open sourcing the entire machinery.
Also, any large scale system like Borg developed at a large company like Facebook or Google will have completely opinionated one-way-of-doing-things for a lot of aspects. This doesn’t work for the world outside where lots of developers from different backgrounds, lots of projects with different requirements exist.
I think this bit from "Borg, Omega, and Kubernetes"[1] (which is an excellent read) sheds light on this:
> The Borgmaster is a monolithic component that knows the semantics of every API operation. It contains the cluster management logic such as the state machines for jobs, tasks, and machines; and it runs the Paxos-based replicated storage system used to record the master’s state.
So it sounds as though Borg includes its own storage system. As I understand, Google has a set of (very complex) libraries written in C++ that implement Paxos/Multi-Paxos[2], which they have not open sourced.
IIRC from one of their talks.... K8s was supposed to be Borg 2.0 in many respects. They decided early on in development that it was a good tool and had lots of potential, but "fixing" Borg would be less work than replacing it. So k8s takes the Borg 2.0 concepts without being any of Borg code.