| HN Mirror

Mainly just years of experience at different companies watching ZK, etcd, etc. fall over at scale and require teams of people to maintain them.

We have had zero outages caused by our eventually consistent discovery system with active health checking (knock on wood), and haven't really touched the discovery service code in months. It just runs.

I'm not saying that a system using ZK, etc. can't be made to work. It certainly can since many companies do it. It's mostly that I think those solutions are actually making the overall problem a lot more complicated and prone to failure than it has to be.