| > It was a latent bug in the control plane that updated the records, not the data plane Yes, I know that. But part of the issue is that the control plane exists in the first place to smooth the impedance mismatch between DNS and how dynamic service discovery works in practice. If we had a protocol which better handled dynamic service discovery, the control plane would be much less complex and less prone to bugs. As far as I have seen, most cloud providers internally use their own service discovery systems and then layer dns on top of that system for third party clients to access. For example, DynamoDB is registered inside of AWS internal service discovery systems, and then the control plane is responsible for reconciling the service discovery state into DNS (the part which had a bug). If instead we have a standard protocol for service discovery, you can drop that in place of the AWS internal service discovery system and then clients (both internal and external) can directly resolve the DynamoDB backends without needing a DNS intermediary. I don’t know how AWS or DynamoDB works in practice, but I have worked at other hyperscalers where a similar setup exists (DNS is layered on top of some internal service discovery system). > If you’re doing to replace DNS, you’re going to have a steep hill to climb. Yes, no doubt. But as we have seen with wireguard, if there is a good idea that has merit it can be quickly adopted into a wide range of operating systems and libraries. |
DNS is a service discovery protocol! And a rather robust one, too. Don’t forget that.
AWS doesn’t want to expose to the customer all the dirty details of how internal routing is done. They want to publish a single regional service endpoint, put a SLO on it, and handle all the complexity themselves. Saving unnecessary complexity from customers is, after all, one of the key value propositions of a managed service. It also allows the service provider the flexibility to change the underlying implementation without impacting customer clients.
I’m not sure the best response to “the reconciler had a bug, and other reconcilers might, too” is to replace it with an entirely new and untested service discovery protocol. A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.” Fail open, as it were.
Also, anyone proposing a new protocol in response to a problem—especially one that had nothing to do with the protocol itself—should probably be burdened with defining and implementing its replacement. ;)