Hacker News new | ask | show | jobs
by edutechnion 3932 days ago
I went down a similar road with etcd and fleet but abandoned it earlier this summer after testing failure scenarios with etcd. With a cluster of 5 etcd nodes in EC2, I started hard-killing etcd EC2 instances and noticed fleet inconsistency (e.g., nodes being restarted, not able to see the entire fleet).

Can you expand on the etcd growth pains you've been through?

2 comments

The basis of this, was being pointed in the right direction by the community.

etcd had a HUGE issue with the implementation of the raft consensus algorithm they were using. This was in version 0.x

The tough part was that, even though etcd 2.0 was released in January [1], it was not put into CoreOS alpha until April [2]

After moving to 2.x - all my problems went away. It had a small learning curve of setting up lots of nodes in the cluster vs proxies [3]. 2.x had a lot of functionality added, but the main one for us was it's reliability. Being able to query status of members, add/remove members from the cluster and monitoring.

Before etcd 2.x, the whole etcd infrastructure would die (and consequently, fleet) if just ONE node restarted. Needless to say, it's come a long way.

We've been running etcd 2.x since January in a container [4], then just doing export FLEETCTL_ENDPOINT=http://127.0.0.1:2379

[1] - https://coreos.com/blog/etcd-2.0-release-first-major-stable-...

[2] - https://coreos.com/blog/coreos-alpha-with-etcd-2/

[3] - https://coreos.com/etcd/docs/latest/admin_guide.html

[4] - https://coreos.com/blog/Running-etcd-in-Containers/

I would definitely recommend that you reevaluate it with a newer version of etcd, as it has had some significant stability improvements post-2.0. I've been doing some fuzz testing of it lately and found that it has gotten much more reliable.