Hacker News new | ask | show | jobs
by jmsgwd 38 days ago
> The idea that AWS's services are fully regionalized or isolated has always been a myth.

This is highly misleading. It's true that there's a handful of global AWS services - but only their control planes operate from a single region (e.g. us-east-1). Their data planes are regionally isolated or globally distributed.[1]

The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources or update configuration during a change window.

Workloads should be designed for "static stability", as recommended by AWS.[2] A statically stable workload only depends upon the data planes of the services it uses at runtime. Statically stable workloads are designed to continue operating as normal even if there's a service event impairing one or more control planes (including for global services).

> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region [3]. The IAM data plane, which enforces access control, is also regional.

If the IAM control plane is impaired, you might not be able to create new IAM roles (a control plane operation) - but you can continue generating and using temporary credentials for existing IAM roles (data plane operations) within the region your workload is running in. This allows statically stable workloads to continue using IAM without interruption.

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"Global AWS services still follow the conventional AWS design pattern of separating the control plane and data plane in order to achieve static stability. The significant difference for most global services is that their control plane is hosted in a single AWS Region, while their data plane is globally distributed."

[2] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."

[3] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"STS is a data plane-only service that is separate from IAM, and does not depend on the IAM control plane."

1 comments

You're right of course to distinguish the control plane and data plane, and it sounds like you know more about this than I do for IAM.

I disagree, though, that my post was "highly misleading" despite this omission.

As a practical matter, some services fail to achieve the "static stability" you describe, in terms of not depending on other services’ control planes.

And also, many on-calls ops and firefighting tasks (to say nothing of canaries and other automated tests) depend on other services’ control planes.

And above all, many AWS engineers (myself very much included even after years there) don't have a clear understanding of the boundaries of other services’ control planes. https://news.ycombinator.com/item?id=48078254

> > During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

> This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region.

I didn't mention STS in the service to which you're responding. The service that I worked on the most, RDS, required ssh'ing into live instances to solve basically all non-trivial problems (I'd guess 80% of the tickets that I saw actually resolved required it). And I have no idea if it how STS was involved in generating the ephemeral Midway-signed ssh keys required for it… but whenever there were us-east-1 IAM outages we'd have big problems opening new sessions, while less-capable web-console-based ops tools with long-lived credentials would keep working.