|
|
|
|
|
by otterley
228 days ago
|
|
> If instead we have a standard protocol for service discovery, you can drop [reconciliation] in place of the AWS internal service discovery system and then clients (both internal and external) can directly resolve the DynamoDB backends without needing a DNS intermediary. DNS is a service discovery protocol! And a rather robust one, too. Don’t forget that. AWS doesn’t want to expose to the customer all the dirty details of how internal routing is done. They want to publish a single regional service endpoint, put a SLO on it, and handle all the complexity themselves. Saving unnecessary complexity from customers is, after all, one of the key value propositions of a managed service. It also allows the service provider the flexibility to change the underlying implementation without impacting customer clients. I’m not sure the best response to “the reconciler had a bug, and other reconcilers might, too” is to replace it with an entirely new and untested service discovery protocol. A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.” Fail open, as it were. Also, anyone proposing a new protocol in response to a problem—especially one that had nothing to do with the protocol itself—should probably be burdened with defining and implementing its replacement. ;) |
|
That is not what I am proposing. The current state is that there are two reconcilers (DNS and internal service discovery) and collapsing those into one reconciler protocol will simplify the system.
> especially one that had nothing to do with the protocol itself
Part of the problem is the increased system complexity by layering multiple service discovery systems on top of each other.
> A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.”
You cannot pre-emptively predict all possible bugs and race conditions. How can I create alerts for all of the failure conditions I have not thought of? A better assumption is that all systems will fail, and one of the things you can do to reduce failure rate is to simplify the system. Additionally, you can segment the system into shards/cells and roll out config and code changes serially to each cell to catch issues before they affect 100% of customers.
I am not hand waving or yelling at the clouds here. I have worked on service discovery for hyperscalars and have witnessed similar outages where the impedance mismatch between internal service discovery and DNS causes issues.