| "If this happened our cluster would become unavailable and may have trouble re-clustering." This was basically the repeated experience I had which caused me to abandon etcd for the time being. If it can barely ever heal, what the fuck good is it? And I found that it could barely ever heal. A 3-node CoreOS cluster I ran _always_ crashed when it attempted a coordinated update, and rarely could be repaired with the help of #CoreOS over hours. Because CoreOS pushes out updates with versions of etcd incompatible with recent versions, the etcd cluster could never survive the upgrade. Add this to the fact that the CEO of CoreOS told me in person that he expected them to be the _only_ Operating System on the internet, and I'm generally not along for the ride with CoreOS any longer. Consul, Mesos, and Docker are looking good. Anyone interested in this space should check out: https://github.com/CiscoCloud/microservices-infrastructure
|
But I also handle upgrading releases differently, that's not something I trusted from the beginning and it's easy enough to disable their update system and stand up new instances with upgraded CoreOS images.
Also, looking at your quote I would consider it very out of context, the previous sentence right before that:
"If there were any changes to these etcd machines, AWS would reboot them to apply the changes, potentially all at the same time."
So they had Cloudformation potentially rebooting all there machines at the same time, I think any cluster is going to have an issue when that happens and really has nothing to do with CoreOS's update system.