Hacker News new | ask | show | jobs
by cle 2438 days ago
> Then when a reader connects, instead of connecting directly to the nsqlookupd discovery service, the reader connects to a proxy. The proxy has two jobs. One is to cache lookup requests, but the other is to return only in-zone nsqd instances for zone-aware clients.

> Our forwarders that read from NSQ are then configured as one of these zone-aware clients. We run three copies of the service (one for each zone), and then have each send traffic only to the service in its zone.

Isn't this the default behavior of ELB/NLB to begin with? Why not just configure the zone-aware clients to call zonal LBs, instead of hosting your own LB? Same with Consul. I'm not understanding what benefit Segment gets from using Consul vs. calling EC2 Metadata API to discover the AZ and then calling the appropriate zonal LB endpoint...that's not hard to do and avoids many extra dimensions of operational complexity.

It's also unclear to me how all this migration to intra-AZ routing affects Segment's resilience to AZ outages.

2 comments

Consul allows transparent failover to be built in easily. So it can prefer your AZ-local service, but if that becomes unavailable, it can fail over to the next-nearest service, be it in a different AZ or an entirely different region. The direct lookup you describe would not be able to handle failover in an intelligent way. Consul can also provide DNS automatically for your services, route based on network tomography, and the latest versions can provide automatic mTLS between services, and descriptive network security rules. Not to mention providing a handy place to store config state and send events.

Beyond that, ELBs have a significant cost if you are running multiple for each internal service you might have, and the API is slow and cumbersome compared to dealing with Consul's service-centric API. From an operations POV, Consul's ACL system is also a lot more flexible than what AWS IAM can provide. So you can be sure your services are limited in what they can claim to be and what gets set up on their behalf. Whereas if you want to automate creation and configuration of ELBs, you are going to have to either grant more access than you really want or you'll have to abstract that behind another service that you have to write.

As for AZ outages... in practice, a cross-AZ system is often just as vulnerable to problems from the outage of a particular AZ, especially if any autoscaling is involved. AWS's tools around this are severely lacking, despite what they tell us about resiliency best practices. But it all depends on the architecture and mostly the data layer.

> As for AZ outages... in practice, a cross-AZ system is often just as vulnerable to problems from the outage of a particular AZ, especially if any autoscaling is involved.

If a system is not resilient to an outage of a particular AZ, by definition I would not call it a 'cross-AZ system'. Maybe what you have in mind is systems that in practice _think_ they are cross-AZ resilient but are actually not when you look closer?

The EC2 Metadata API isn’t meant for high-throughput calls, so it’s possible to hit rate limits even from moderate polling once you get enough nodes involved.
Why do you have to constantly poll it? It's once on startup to discover the zone it's running in.
The AZ for an instance is fixed. Check it at startup time, and cache it in-memory.