| I haven’t personally worked on envoy xds, but it is what I have seen several BigCo’s use for routing from the edge to internal applications. > Running consensus transcontinentally is very painful You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas. I have seen the following scheme work for millions of workloads: 1. Etcd quorum across 3 close, but independent regions 2. On startup, the app registers itself under a prefix that all other app replicas register 3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients. 4. A custom grpc resolver is used to do lookups by service name |
Two other details that are super important here:
This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.
The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.
The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.