Hacker News new | ask | show | jobs
by doublerebel 3567 days ago
Thanks for all the info. Any insight into the service discovery issues described in the docs [1]?

  Many existing RPC systems treat service discovery as a
  fully consistent process. To this end, they use fully
  consistent leader election backing stores such as
  Zookeeper, etcd, Consul, etc. Our experience has been
  that operating these backing stores at scale is painful.
[1]: https://lyft.github.io/envoy/docs/intro/arch_overview/servic...
1 comments

Mainly just years of experience at different companies watching ZK, etcd, etc. fall over at scale and require teams of people to maintain them.

We have had zero outages caused by our eventually consistent discovery system with active health checking (knock on wood), and haven't really touched the discovery service code in months. It just runs.

I'm not saying that a system using ZK, etc. can't be made to work. It certainly can since many companies do it. It's mostly that I think those solutions are actually making the overall problem a lot more complicated and prone to failure than it has to be.